Skip to content

Instantly share code, notes, and snippets.

@dbousamra
Created January 17, 2019 23:06
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dbousamra/f662f381d33fcf5c4a5475c4a656fa19 to your computer and use it in GitHub Desktop.
Save dbousamra/f662f381d33fcf5c4a5475c4a656fa19 to your computer and use it in GitHub Desktop.
author title
Lionel Flandrin
Playstation Emulation Guide

Introduction

This is my attempt at documenting my implementation of a PlayStation emulator from scratch. I'll write the document as I go and I'll try to explain as much as possible along the way. You can find the complete source of the emulator itself in my GitHub repository.

Since my favourite passtime is to reinvent the wheel and recode things that already exist I decided that this time I might as well document it. This way maybe this time something useful will come out of it and it'll give me a motivation to finish it.

I will be using the Rust programming language but this is not meant as a Rust tutorial and knowledge of the language shouldn't be necessary to follow this guide, although it won't hurt.

Isn't emulation complicated?

Emulation requires some low-level knowledge about how computers work and some basics in electronics might help for certain things. Since this doc is meant as an introduction to emulation I'll assume that the reader doesn't bring anything with them beyond some decent programming skills. So don't worry if you're not familiar with registers, cache, memory mapped IO, virtual memory, interrupts and other low level fun: I'll try to explain everything when needed. Emulators are a good introduction to low level programming without having to bother with that pesky hardware in person!

Since this is supposed to be a general guide about writing PlayStation emulators I won't put the entire source code of the emulator here, only snippets relevant to the matter beind discussed.

Finally, keep in mind that getting a PlayStation emulator even capable to run some games decently will require quite a lot of work. Don't expect to play Final Fantasy VII on your brand new emulator in two days. If you want to start with something simpler to see if you have a taste for it you can search for Chip-8, Game Boy or NES emulation tutorials (by increasing complexity).

Feedback

If some part of this document is unclear, poorly written or incomplete please submit an issue so that I can fix or complete it. Corrections for grammar, syntax and typos are very welcome. Thank you!

Ready? Let's begin!

The CPU: Instructions and the memory

What is a CPU, anyway?

That might seem like a silly question to some but I'm sure there are plenty of competent programmers out there who are used to program in high level managed environements haven't seen a register in their entire life. Let me make the introductions.

For our first version of the PlayStation CPU I'm going to make some simplifying assumptions. I'm going to ignore the caches for instance and assume that it directly accesses the system bus. Basically we're going to implement a Von Neumann architecture.As we make progress we'll have to revisit this design to add the missing bits when they are needed.

The objective of this section is to implement all the instructions and try to reach the part of the BIOS where it starts to draw on the screen. As we'll see there's a bunch of boring initialization code to run before we get there.

There are 67 opcodes in the Playstation MIPS CPU. Some take one line to implement, others will give us more trouble. In order to make the process more interactive and less tedious we'll implement them as they're encountered while we're running the original BIOS code. This way we'll immediately be able to see our emulator in action.

But first things first, before we start implementing instructions we need to explain how a CPU works.

Architecture

A simple Von Neumann architecture looks like this: the CPU only sees a flat address space: an array of bytes. The PlayStation uses 32bit addresses so the CPU sees 1 << 32 addresses. In other words it can address 4GB of memory. That's why the PlayStation is said to be a 32bit console (that and the fact that it uses 32bit registers in the CPU as we'll see in a minute).

This address space contains all the external ressources the CPU can access: the RAM of course but also the various peripherals (GPU, controllers, CD drive, BIOS...). That's called memory mapped IO. Note that in this context "memory" doesn't mean RAM. Rather it means that you access peripherals as if they were memory (instead of using dedicated instructions for instance). From the point of view of the CPU, everything is just a big array of bytes and it doesn't really know what's out there.

Of course we'll have to figure out how the devices and RAM are mapped in this address space to make sure the transactions end up at the right location when the CPU starts reading and writing to the bus. But first we need to understand how the code is executed.

The code

In this architecture the instructions live in the global address space along with everything else. Typically in RAM but again, the CPU doesn't care. If you want to run code from the controller input port I'm sure the console will let you. Probably not very useful but it's all the same as far as the CPU is concerned.

So somewhere in this 4GB address space there's the next instruction for the CPU to run. How does it know the address of this instruction? By using a register of course!

The Program Counter register

Registers are very small and very fast special purpose memories built inside the CPU. Most CPU instructions manipulate those registers by adding them, multiplying them, masking them, storing their content to memory or fetching it back...

The Program Counter (henceforth refered to as PC) is one of the most elementary registers, it exists in one form or an other on basically all computer architectures (although it goes by various names, on x86 for instance it's called the Instruction Pointer, IP). Its job is simply to hold the address of the next instruction to be run.

As we've seen, the PlayStation uses 32bit addresses, so the PC register is 32bit wide (as are all other CPU registers for that matter).

A typical CPU execution cycle goes roughly like this:

  1. Fetch the instruction located at address PC,

  2. Increment the PC to point to the next instruction,

  3. Execute the instruction,

  4. Repeat

We need to know how big an instruction is in order to know how many bytes to fetch and how much we need to increment the PC to point at the next instruction. Some architectures have variable length instructions (x86 and derivatives are a common example) which means we'd have to decode the instruction to know how many bytes it takes. Fortunately for us, the PlayStation uses a fixed length instruction set (The MIPS instruction set) and all instructions are 32bit long.

With all that in mind we can finally start writing some code!

Here's what the CPU state looks like at that point:

/// CPU state
pub struct Cpu {
    /// The program counter register
    pc: u32,
}

And here's the implementation of our CPU cycle described above:

impl Cpu {

    pub fn run_next_instruction(&mut self) {
        let pc = self.pc;

        // Fetch instruction at PC
        let instruction = self.load32(pc);

        // Increment PC to point to the next instruction.
        self.pc = pc.wrapping_add(4);

        self.decode_and_execute(instruction);
    }
}

In Rust wrapping_add means that we want the PC to wrap back to 0 in case of an overflow (i.e. 0xfffffffc + 4 => 0x00000000). We'll see that most CPU operations wrap on overflow (although some instructions catch those overflows and generate an exception, we'll see that later).

If you're coding in C you don't need to worry about that if you use uint32_t since the C standard mandates that unsigned overflow wraps around in this fashion. Rust however says that overflows are undefined and will generate an error in debug builds if an unchecked overflow is detected, that's why I need to write pc.wrapping_add(4) instead of pc + 4.

We now finally have some code but it doesn't build yet.

We're still missing 3 pieces of the puzzle before we can run this piece of code:

  • What's the initial value of PC when starting up?

  • How do we implement the fetch32 function?

  • How do we implement the decode_and_execute function?

Reset value of the PC

In integrated circuits reset is a state where the chip generally does nothing and its internal state is set to some known default "factory" value. What exactly the reset does varies from chip to chip (it's just a convention) but it's assumed that a chip will restart in a clean and deterministic state after a reset cycle.

Generally the reset is a dedicated pin on the chip that's connected to a button or some other control logic. Sometimes you can also request a "soft" reset through software using a specific command or sequence of instructions. Reseting a chip does necessitate cutting off the power (nor is power cycling an integrated circuit a good way to reset a chip: if the reset signal is not asserted it might not load the default values correctly).

When you power up the console or hit the reset button the hardware forces the CPU (and other peripherals) into a reset state to initialize the logic.

Knowing this it's pretty obvious that the reset value of the PC is very important since it's going to tell the CPU where it should start running the code. It basically defines the location of the "main" function of the console's kernel.

The docs say that the reset value of PC is 0xbfc00000. In the playstation memory map that's the beginning of the BIOS (we'll look at the memory map in greater details in the next section).

Now that we know where our story starts we can write our CPU initializer:

impl Cpu {

    pub fn new() -> Cpu {
        Cpu {
            // PC reset value at the beginning of the BIOS
            pc: 0xbfc00000,
        }
    }

    //...
}

The Playstation memory map

Our CPU treats all addresses the same way but at some point we'll have to dispatch the load/store requests to the correct peripheral. If we read the BIOS and we get GPU data instead we're going to run into troubles very quickly...

So how do we know what is mapped at some arbitrary address? By using the memory map of course!

Here's an overview of the PlayStation memory map, courtesy of the Nocash specs:

KUSEG KSEG0 KSEG1 Length Description


0x00000000 0x80000000 0xa0000000 2048K Main RAM 0x1f000000 0x9f000000 0xbf000000 8192K Expansion Region 1 0x1f800000 0x9f800000 0xbf800000 1K Scratchpad 0x1f801000 0x9f801000 0xbf801000 8K Hardware registers 0x1fc00000 0x9fc00000 0xbfc00000 512K BIOS ROM

: Playstation memory map[]{label="tab:mmap"}

Let's take the time to parse through this.

We can see that most peripherals in table [tab:mmap]{reference-type="ref" reference="tab:mmap"} are mapped at several addresses. For instance if we look at the PC reset value 0xbfc00000 corresponds to the beginning of the BIOS range in region KSEG1. However we can also reach the same location through addresses 0x1fc00000(KUSEG) and 0x9fc00000(KSEG0).

What's the point of having those mirrored regions? What's the difference between KUSEG and KSEG1 for instance? Those are memory regions which are used to specify certain attributes of the memory access. On the Playstation hardware it's mostly used to specify whether the access is cached or not.

For now we're going to ignore regions and treat all mappings the same, we'll study them more closely later on.

KSEG2 Length Description


0xfffe0000 512B I/O Ports

: KSEG2 memory map[]{label="tab:kseg2"}

Table [tab:kseg2]{reference-type="ref" reference="tab:kseg2"} shows the last region: KSEG2. It's a bit different from the others. It doesn't mirror the other regions, instead it gives access to a unique set of registers. As far as I know the only important register there is the cache control but there might be others I haven't encountered yet.

Implementing the memory map

In order to implement the PlayStation memory map in our emulator we will need an interconnect to dispatch the load/store operations to the correct peripheral.

I don't know if the PlayStation really has a hardware interconnect. The CPU could just "broadcast" the read/write operations on the system bus and the peripherals would check the address and only answer if it's for them. However this design would be inefficient in software: we'd need to iterate over the peripherals for each transaction until we find the correct receiver.

Instead we're just going to implement a "switchboard" that will match the address to the correct peripheral and forward it there.

Since the first thing the emulator will run is the BIOS we'll use it as our first peripheral.

The BIOS

On the PlayStation the BIOS displays the first screens (with the logos and that memorable sweeping tune) and starts the game from the CD drive. If no CD is present it displays a menu that can be used to manage the memory cards and play CDs. As a player that's probably the only time you'd know there was a BIOS running.

But that's just the tip of the iceberg! The BIOS remains loaded at all time and provides a Basic Input/Output System to the running game. That means that the game can call into the BIOS to do things like allocating memory, reading the memory card, common libc functions (qsort, memset...) and many other things.

We won't be implementing the BIOS ourselves. It's possible (and it's been done) but that's a lot of work and probably something you'd want to do once you have a working emulator. It might also hurt compatibility since many games are known to patch the BIOS at runtime. The Nocash specs have more info.

We could dump the BIOS of a console but that requires access to the actual hardware and the know-how to access the BIOS memory. Fortunately some nice people have done it for us and these days it's easy to find BIOS files on the web.

There are many BIOS versions: they change depending on the region, the hardware revision and patches. Any good dump should work (after all, they all do more or less the same thing) but if you're following this guide it's probably better that we use the same file.

Algorithm Hash


MD5 924e392ed05558ffdb115408c263dccf SHA-1 10155d8d6e6e832d6ea66db9bc098321fb5e8ebf

: SCPH1001.BIN BIOS checksums[]{label="tab:checksums"}

I've decided to go for the version named SCPH1001.BIN. The file should be exactly 512KB big. Check table [tab:checksums]{reference-type="ref" reference="tab:checksums"} to make sure you got the right one.

Loading the BIOS

Once we got our BIOS the rest is pretty straightforward. We just read the file into a 512KB buffer:

/// BIOS image
pub struct Bios {
    /// BIOS memory
    data: Vec<u8>
}

impl Bios {

    /// Load a BIOS image from the file located at `path`
    pub fn new(path: &Path) -> Result<Bios> {

        let file = try!(File::open(path));

        let mut data = Vec::new();

        // Load the BIOS
        try!(file.take(BIOS_SIZE).read_to_end(&mut data));

        if data.len() == BIOS_SIZE as usize {
            Ok(Bios { data: data })
        } else {
            Err(Error::new(ErrorKind::InvalidInput,
                           "Invalid BIOS size"))
        }
    }
}

/// BIOS images are always 512KB in length
const BIOS_SIZE: u64 = 512 * 1024;

We also need to be able to read data from the BIOS. The CPU wants to read 32bit of data to load the instructions so let's start by implementing load32:

impl Bios {
    // ...

    /// Fetch the 32bit little endian word at `offset`
    pub fn load32(&self, offset: u32) -> u32 {
        let offset = offset as usize;

        let b0 = self.data[offset + 0] as u32;
        let b1 = self.data[offset + 1] as u32;
        let b2 = self.data[offset + 2] as u32;
        let b3 = self.data[offset + 3] as u32;

        b0 | (b1 << 8) | (b2 << 16) | (b3 << 24)
    }
}

A few things to note: offset, as its name implies, is not the absolute address used by the CPU, it's just the offset in the BIOS memory range. Remember that the BIOS is mapped in multiple regions so we'll handle that in the generic interconnect code. Each peripheral will just handle offsets in its address range.

In the comment I mention that we read the word in little endian. That's important. If you've never had to worry about endianess issues before let me give you the gist.

The basic unit of memory is a byte (8 bits in our case). You cannot address anything smaller than that. However sometimes you need to store data over multiple bytes. For instance we've seen that our instructions are 4byte long. We have multiple way to store 4byte words in our "array of bytes".

Let's take an example: you have the 32bit word 0x12345678. You have multiple way to store that value in 4 consecutive bytes. We can store [0x12, 0x34, 0x56, 0x78] or [0x78, 0x56, 0x34, 0x12] for instance. The former is called big-endian because we store the most significant byte first. The latter is little-endian because we store the least significant byte first. There are other endian types with weirder patterns but they're not often used is modern computers. Check wikipedia if you want more details.

The PlayStation is little-endian so we're in the 2nd case: when reading or writing multi-byte values the least significiant byte goes first. If we do it the other way around we'll end up with garbage.

Now we can implement our interconnect to let the CPU communicate with the BIOS.

The interconnect

We now have an embryo of a CPU and our first device ready to talk to each other. We just need to figure out how to link them together.

At that point we could have the CPU talk directly to the BIOS, after all it's our only device. Obviously that won't work for very long however, we need to be able to dispatch the CPU's loads and stores to the correct peripheral depending on the address range.

I'm not quite sure how this is handled on the actual hardware. For simple buses it's very possible that the CPU just "broadcasts" the address to all the peripherals and each of them just checks if it's within their address range and simply ignores the transaction if they see it's not for them. It's fast in hardware because all peripherals work in parallel so there's no delay induced: they can all receive and decode the address at the same moment.

Unfortunately we can't really do that in software: the closest equivalent would be to spawn a thread for each peripheral. The problem is that memory transactions are very common (several millions per second potentially) and having to send data and resynchronize across threads would kill our performances.

Multihreading emulators in general is a very tough issue: for threading to be really efficient you need to reduce data exchange and resynchronization as much as possible to let each thread live its life. When we emulate however we want to mimick the original hardware behaviour and speed as much as possible which requires very frequent resynchronization and we have plenty of shared state. The two endeavors are somewhat at odds. That's not to say multithreading is impossible in emulators, just that it's hard. We can't just spawn threads willy-nilly.

Anyway, back to our interconnect: since threads are out it means we'll have to sequentially match the address against each mapping until we get a match. Then we can let the selected peripheral handle the transaction.

Let's do just that:

/// Global interconnect
pub struct Interconnect {
    /// Basic Input/Output memory
    bios: Bios,
}

impl Interconnect {
    pub fn new(bios: Bios) -> Interconnect {
        Interconnect {
            bios: bios,
        }
    }
}

I've decided to store the BIOS directly in the interconnect struct. We'll append the other peripherals there as we implement them. We are going to store the interconnect inside the struct Cpu which will give us a device tree with the CPU at the top. It makes the data paths pretty simple: everything goes from the CPU to the peripherals. It's easier to reason about than a full "everybody sees everybody" architecture in my opinion but it might prove limiting as we progress. We'll see if we need to revise that later.

Now we can finally implement the load32 function that the CPU will be using. I don't like having hardcoded constants all over the place so I'm going to tie the address ranges to nice symbolic names:

mod map {
    struct Range(u32, u32);

    impl Range {
        /// Return `Some(offset)` if addr is contained in `self`
        pub fn contains(self, addr: u32) -> Option<u32> {
            let Range(start, length) = self;

            if addr >= start && addr < start + length {
                Some(addr - start)
            } else {
                None
            }
        }
    }

    pub const BIOS: Range = Range(0xbfc00000, 512 * 1024);
}

If you're not familiar with rust what this does is create a new type Range which is a tuple of two values: the start address and length of the mapping.

I also declare a contains methods which takes an address and returns Some(offset) if the address is within the range, None otherwise. You can think of it as a form of multiple return values with some nice type-safety on top.

Finally I declare our first range for the BIOS.

Now for the load32 function:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {

        if let Some(offset) = map::BIOS.contains(addr) {
            return self.bios.load32(offset);
        }

        panic!("unhandled fetch32 at address {:08x}", addr);
    }
}

The if let syntax is an other rust nicety: if the contains function returns Some(offset) we enter the body of the if with offset bound to a temporary variable. If contains returns None on the other hand the if is refuted and we don't enter the body and go straight to the panic! command which will make our emulator crash.

Gluing the interconnect to the CPU

The only thing left before we can finally build our code is gluing the Interconnect with the Cpu.

We add an inter member to the struct Cpu and take an Interconnect object in the constructor:

/// CPU state
pub struct Cpu {
    /// The program counter register
    pc: u32,
    /// Memory interface
    inter: Interconnect,
}

impl Cpu {

    pub fn new(inter: Interconnect) -> Cpu {
        Cpu {
            // PC reset value at the beginning of the BIOS
            pc: 0xbfc00000,
            inter: inter,
        }
    }

    // ...
}

We can also implement the load32 function for the CPU which will just call the interconnect.

impl Cpu {
    //...

    /// Load 32bit value from the interconnect
    fn load32(&self, addr: u32) -> u32 {
        self.inter.load32(addr)
    }
}

We're still lacking the decode_and_execute function, let's use a placeholder function that just panics for now:

impl Cpu {
    //...

    fn decode_and_execute(&mut self, instruction: u32) {
        panic!("Unhandled instruction {:08x}",
               instruction);
    }
}

Finally we can instantiate everything in our main function:

fn main() {
    let bios = Bios::new(&Path::new("roms/SCPH1001.BIN")).unwrap();

    let inter = Interconnect::new(bios);

    let mut cpu = Cpu::new(inter);

    loop {
        cpu.run_next_instruction();
    }
}

I've hardcoded the BIOS path for now. It would be better to read it from the command line, a config file or even some fancy dialog window but it'll do nicely for now.

We should now be able to build the code. When I run it, assuming that the BIOS file was found at the correct location I get:

thread `<main>' panicked at 'Unhandled instruction 3c080013'

As expected the decode_and_execute function died on us but we managed to fetch an instruction. If you've been using the same BIOS file as me you should have exactly the same value of 0x3c080013. If you got an other value something is wrong with your code. In particular if you end up with 0x1300083c it means you're erroneously reading in big-endian.

Instruction decoding

We've now fetched our first instruction from the BIOS: 0x3c080013. What do we do with this?

In order to be able to run this instruction we need to decode it to figure out what it means. Instruction encoding is of course CPU dependent so we need to interpret this value in the context of the Playstation R3000 processor. Once again the Nocash specs have our back and list the format of the instruction. MIPS is a common architecture used outside of the playstation and you can find plenty of resources online describing its instruction set.

Let's decode this one by hand to see how this works. If we look at the "Opcode/Parameter Encoding" table in Nocash's docs we see that we need to look at the bits [31:26] of the operation to see what type it is. In our case they are 001111. That means the operation is a LUI or "Load Upper Immediate". Immediate means that the value loaded is directly in the instruction, not indirectly somewhere else in memory. Upper means that it's loading this immediate value into the high 16 bits of the target register. The 16 low bits are cleared (set to 0).

But what are the register and the value used by the instruction? Well we need to finish decoding it to figure it out: for a LUI bits [20:16] give us the target register: in our case it's 01000 which means it's register 8. Finally bits [15:0] contain the immediate value: 0000 0000 0001 0011 or 19 in decimal. Bits [25:21] are not used and their value doesn't matter.

In other words this instruction puts 0x13 in the 16 high bits of the register 8. In MIPS assembly1 it would be equivalent to:

lui $8, 0x13

Enough babbling, let's implement decoding. First I'll wrap the raw instruction in a nice interface that will let us extract the fields without doing the bitshifts and masking everywhere. If you look at the encoding for other MIPS instructions you'll see that it's fairly regular, for instance immediate values are always stored in the LSBs:

struct Instruction(u32);

impl Instruction {
    /// Return bits [31:26] of the instruction
    fn function(self) -> u32 {
        let Instruction(op) = self;

        op >> 26
    }

    /// Return register index in bits [20:16]
    fn t(self) -> u32 {
        let Instruction(op) = self;

        (op >> 16) & 0x1f
    }

    /// Return immediate value in bits [16:0]
    fn imm(self) -> u32 {
        let Instruction(op) = self;

        op & 0xffff
    }
}

The names for the accessor functions match those I've seen used in the various references to name the various fields.

We can now leverage that fancy interface in decode_and_execute:

impl Cpu {
     // ...

    fn decode_and_execute(&mut self, instruction: Instruction) {
        match instruction.function() {
            0b001111 => self.op_lui(instruction),
            _        => panic!("Unhandled instruction {:x}", instruction.0),
        }
    }

    /// Load Upper Immediate
    fn op_lui(&mut self, instruction: Instruction) {
        let i = instruction.imm();
        let t = instruction.t();

        panic!("what now?");
    }
}

We're very close to finally run our first instruction in full but we're still missing something: we see that the register field in this instruction is 5bits, that means it can index 32 registers. But for now we only have one register in our CPU: the PC. We need to introduce the rest of them.

General purpose registers

Register Name Conventional use


$0 $zero Always zero $1 $at Assembler temporary $2, $3 $v0, $v1 Function return values $4...$7 $a0...$a3 Function arguments $8...$15 $t0...$t7 Temporary registers $16...$23 $s0...$s7 Saved registers $24, $25 $t8, $t9 Temporary registers $26, $27 $k0, $k1 Kernel reserved registers $28 $gp Global pointer $29 $sp Stack pointer $30 $fp Frame pointer $31 $ra Function return address

: R3000 CPU general purpose registers[]{label="tab:cpuregs"}

Table [tab:cpuregs]{reference-type="ref" reference="tab:cpuregs"} lists the registers in the Playstation MIPS R3000 CPU (ignoring the coprocessors for now). They're all 32bit wide.

You can see that we have 32 registers ($0 to $31) which are the general purpose registers. They're all given a mnemonic used when writing assembly. For instance, by convention, $29 is the stack pointer($sp) and $30 holds the frame pointer ($fp).

It's important to understand that those are just a convention between developers, in the hardware there's no difference between $29 and $30. The point of those calling conventions is to make it possible to make code generated from different compilers or written in assembly by different coders remain interoperable. If you write MIPS assembly and want to call third party functions (like the BIOS functions for instance) you'll have to adhere to this convention.

Only two general purpose registers are given a special meaning by the hardware itself: $zero and $ra.

The $zero register

$zero ($0) is always equal to 0. If an instruction attempts to load a value in this register it doesn't do anything, the register will still be 0 afterwards.

Having a constant 0 register is useful to reduce the size of the instruction set. For instance if you want to move the value of the register $v0 in $a0 you can write this:

move $a0, $v0

However this "move" instruction is not actually part of the MIPS instruction set, it's just a convenient shorthand understood by the assembler which will generate the equivalent instruction:

addu $a0, $v0, $zero

We can see that it effectively does the same thing by setting $a0 to the result of $v0 + 0 but we avoid having to implement a dedicated "move" instruction in the CPU.

The $ra register

$ra ($31) is the other general purpose register given a special meaning by the hardware since instructions like "jump and link" or "branch and link" put the return address in this register exclusively. Therefore the following instruction jumps in function foo and puts the return address in $ra:

jal foo

As we'll soon see we don't really have to bother with the various roles assigned to those general purpose registers when writing our emulator (with the exception of $zero and $ra) but it's still useful to know the convention when trying to understand what some emulated code is doing.

Special purpose registers

Name Description


PC Program counter HI high 32bits of multiplication result; remainder of division LO low 32bits of multiplication result; quotient of division

: R3000 CPU special purpose registers[]{label="tab:specialcpuregs"}

Table [tab:specialcpuregs]{reference-type="ref" reference="tab:specialcpuregs"} lists the three special purpose CPU registers. We're already familiar with the PC used to keep track of the code execution. The two others are HI and LO which contain the results of multiplication and division instructions. Those cannot be used as general purpose registers, instead there are special instructions used to manipulate them. We'll discover them as we implement them.

Implementing the general purpose registers

I'm just going to represent the 32 general purpose registers as an array of 32 u32 and use the index in the instructions to address them. I'll even have an entry for $zero even though it's always 0 to avoid special cases. Of course we'll have to be careful to always keep its value to 0.

/// CPU state
pub struct Cpu {
    /// The program counter register
    pc: u32,
    /// General Purpose Registers.
    /// The first entry must always contain 0.
    regs: [u32; 32],
    /// Memory interface
    inter: Interconnect,
}

The registers are not initialized on reset, so they contain garbage value when we start up. For the sake of our emulator being deterministic I won't actually put random values in the registers however, instead I'm going to use an arbitrary garbage value 0xdeadbeef. We could as well initialize them to 0 but I prefer to use a more distinguishable value which can be helpful while debugging. We must remember to put 0 in $zero however.

impl Cpu {
    pub fn new(inter: Interconnect) -> Cpu {
        // Not sure what the reset values are...
        let mut regs = [0xdeadbeef; 32];

        // ... but R0 is hardwired to 0
        regs[0] = 0;

        Cpu {
            // PC reset value at the beginning of the BIOS
            pc: 0xbfc00000,
            regs: regs,
            inter: inter,
        }
    }

    fn reg(&self, index: u32) -> u32 {
        self.regs[index as usize]
    }

    fn set_reg(&mut self, index: u32, val: u32) {
        self.regs[index as usize] = val;

        // Make sure R0 is always 0
        self.regs[0] = 0;
    }

    // ...
}

I've also added a getter and a setter. They're very straightforward but I take care to always write 0 in $zero in case it gets overwritten. I don't ever bother checking if the function wrote in this register or an other one, writing a 32bit value is cheap and probably cheaper than adding an if. It's also important to note that the BIOS does try to write to $zero, it is believed that this is useful to discard an I/O result without having to waste a register.

LUI instruction

Now we can finally implement our first instruction in full! Here's what op_lui looks like now:

impl Cpu {

    // ...

    /// Load Upper Immediate
    fn op_lui(&mut self, instruction: Instruction) {
        let i = instruction.imm();
        let t = instruction.t();

        // Low 16bits are set to 0
        let v = i << 16;

        self.set_reg(t, v);
    }
}

Note that the low 16bits are set to 0. It's important as we'll see with the next instruction.

The first instruction in the BIOS uses LUI to put 0x13 in the high 16bits of $8.

ORI instruction

We can directly implement the 2nd instruction: 0x3508243f.

It decodes to:

ori $8, $8, 0x243f

In other words, it puts the result of the bitwise or of $8 and 0x243f back into $8. The previous LUI initialized the high 16bits of $8 and set the rest to 0 so this one will initialize the low 16bits.

That's the simplest way to load a constant in a register with the MIPS instruction set and that's why it's important for LUI to set the low 16bits to 0, otherwise the ORI wouldn't do the right thing.

The implementation is straightforward:

impl Cpu {
    // ...

    fn decode_and_execute(&mut self, instruction: Instruction) {
        match instruction.function() {
            0b001111 => self.op_lui(instruction),
            0b001101 => self.op_ori(instruction),
            _        => panic!("Unhandled instruction {:x}",
                               instruction.0),
        }
    }

    /// Bitwise Or Immediate
    fn op_ori(&mut self, instruction: Instruction) {
        let i = instruction.imm();
        let t = instruction.t();
        let s = instruction.s();

        let v = self.reg(s) | i;

        self.set_reg(t, v);
    }
}

After those two instructions the value of $8 should be 0x0013243f. The next instruction as an other LUI which puts 0x1f800000 in $1.

Writing to memory

The next instruction, 0xac281010, is going to give us a little more trouble. It decodes to the "store word" instruction:

sw $8, 0x1010($1)

If you're not familiar with GNU assembly syntax the 0x1010($1) syntax means "address in $1 plus offset 0x1010". In this case the full instruction is "store the 32bits in register $8 at the location $1 + 0x1010". Given the current values of the $1 and $8 registers it would store 0x0013243f at the address 0x1f801010.

We can implement the storing to memory by mirroring our load32 code:

impl Cpu {
    // ...

    /// Store 32bit value into the memory
    fn store32(&mut self, addr: u32, val: u32) {
        self.inter.store32(addr, val);
    }

    fn decode_and_execute(&mut self, instruction: Instruction) {
        match instruction.function() {
            0b001111 => self.op_lui(instruction),
            0b001101 => self.op_ori(instruction),
            0b101011 => self.op_sw(instruction),
            _        => panic!("Unhandled instruction {:x}", instruction.0),
        }
    }

    // ...

    /// Store Word
    fn op_sw(&mut self, instruction: Instruction) {
        let i = instruction.imm();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);
        let v    = self.reg(t);

        self.store32(addr, v);
    }
}

This code for op_sw is actually subtly broken, I'll explain why in a moment. For these values of addr and i it'll do the right thing though. You can see that we call into the interconnect's store32 method that we have yet to implement. Since the only peripheral we support so far is the BIOS ROM and we can't write to it there's not much we can do at that point, let's just log the access and panic:

impl Interconnect {
    // ...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        panic!("unhandled store32 into address {:08x}", addr);
    }
}

Unaligned memory access

While we're at it I just realized that so far we allow 32bit fetch and store from and to any address. However the architecture won't allow unaligned memory accesses (i.e. 32bit accesses must have an address which is a multiple of 32bits). Many architectures don't support unaligned accesses (it generates a "bus error") and those who do usually implement it at a cost (unaligned accesses are slower). I'd rather add some code in the functions to catch unaligned access, it could help us catch unexpected behaviours when debugging:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {

        if addr % 4 != 0 {
            panic!("Unaligned load32 address: {:08x}", addr);
        }

        //...
    }

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {

        if addr % 4 != 0 {
            panic!("Unaligned store32 address: {:08x}", addr);
        }

    //...
    }
}

Once we implement exceptions we'll be able to handle those conditions properly.

The code should now compile but unsurprisingly it won't manage to execute the SW instruction in full:

'<main>' panicked at 'unhandled store32 into address 1f801010'

The address is not part of the BIOS and therefore we don't support it yet. We can figure out where we're trying to write by going back to the memory map in table [tab:mmap]{reference-type="ref" reference="tab:mmap"}. We can see that we end up in the "Hardware registers" range.

Looking at the specs we see that registers in this range are for "memory control". They're mainly used to set things like access latencies to the various peripherals. We're going to hope we don't need to emulate those very low level settings so we'll ignore the writes to those registers for now.

Expansion mapping

There are two memory control registers we need to be careful about however: registers 0x1f801000 and 0x1f801004 contain the base address of the expansion 1 and 2 register maps. We could emulate dynamic mappings but apparently on the Playstation they're always at 0x1f000000 and 0x1f802000 respectively so we're just going to hardcode those addresses and raise an error if the BIOS or a game ever attempts to remap them to something else (which hopefully shouldn't ever happen).

impl Interconnect {
    //...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(offset) = map::MEM_CONTROL.contains(addr) {
            match offset {
                0 => // Expansion 1 base address
                    if val != 0x1f000000 {
                        panic!("Bad expansion 1 base address: 0x{:08x}", val);
                    },
                4 => // Expansion 2 base address
                    if val != 0x1f802000 {
                        panic!("Bad expansion 2 base address: 0x{:08x}", val);
                    },
                _ =>
                    println!("Unhandled write to MEM_CONTROL register"),
            }
            return;
        }

        panic!("unhandled store32 into address {:08x}", addr);
    }
}

And of course we need to declare the MEM_CONTROL constant:

/// Memory latency and expansion mapping
pub const MEM_CONTROL: Range = Range(0x1f801000, 36);

It's a bit hackish but at least the store will now go through.

Before we move on to the next instruction we need to address the "subtle brokenness" in our SW implementation I was talking about earlier.

Sign extension

The reason our current "Store word" extension is broken is because we're not handling the immediate value correctly. It should be interpreted like a signed 16bit value in a two's complement representation.

In other words, if the immediate value of the SW was 0xffff it would give an offset of -1, not +65535.

16bit value 32bit "unsigned" extended value decimal unsigned value


0x0000 0x00000000 0 0x0001 0x00000001 1 0x01ad 0x000001ad 429 0xffff 0x0000ffff 65535 0x83c5 0x000083c5 33733 16bit value 32bit sign-extended value decimal signed value 0x0000 0x00000000 0 0x0001 0x00000001 1 0x01ad 0x000001ad 429 0xffff 0xffffffff -1 0x83c5 0xffff83c5 -31803

: 16 to 32bit conversion: influence of sign extension[]{label="tab:signextend"}

In order to support this we don't need to add any branching, we just need to sign extend the immediate value. It means that we increase the width of the 16bit value to 32bit but instead of padding with zeroes we pad with the original MSB (which is sometimes called the sign bit). This way the signed value remains the same. See table [tab:signextend]{reference-type="ref" reference="tab:signextend"} for some examples.

You can see that for values where the sign bit is not set if we simply pad the 16 high bits with 0s we get the same result in both signed and unsigned extension. However for values with the MSB set to 1 we have a big difference. So when we extend values it's important to know if we're dealing with signed or unsigned quantities. We'll have the same problem with rightwise bitshifts: if we're shifting signed quantities we have to pad with the sign bit.

It might sounds complicated but it's very straightforward to implement with most programming languages, for instance in C, C++ and Rust simply casting from a 16bit signed integer to a 32bit integer makes the compiler sign-extend the value. If it didn't casting a 16bit variable containing -1 into a 32bit variable would have the final value be 65535 which is obviously not what we want.

We can't guess which instructions use signed or unsigned immediate values, it's described in the MIPS instruction set. For instance our ORI instruction correctly uses an unsigned immediate value.

The nice thing with two's complement representation is that while we need to think about the signedness of the value when bitshifting and widening it doesn't matter for most arithmetic operations.

For instance the 16 bit addition 0x01ad + 0x84e0 gives the same result whether the operands are signed or not: 0x01ad is 429, 0x84e0 is either 34016 if it's unsigned or -31520 if it's a two's complement signed value. 429 + 34016 is 34445 or 0x868d in hexadecimal. 429 - 31520 is -31091 or 0x868d in 16bit two's complement hexadecimal.

You can see that doing the calculation with signed or unsigned quantities doesn't matter: we end up with the same binary pattern.

Therefore we just need to care about the sign when widening the immediate from 16 to 32 bits and then we can proceed with our usual "unsigned" addition and we'll get the correct result whether the offset is negative or positive:

impl Instruction {
    // ...

    /// Return immediate value in bits [16:0] as a sign-extended 32bit
    /// value
    fn imm_se(self) -> u32 {
        let Instruction(op) = self;

        let v = (op & 0xffff) as i16;

        v as u32
    }
}

Note the order of the casts from u32 to i16 back to u32. They might look useless but that's what's forcing the compiler to generate instructions to sign-extend v.

SW instruction

We can now use this function to fix op_sw, we just have to replace instruction.imm() with the new sign-extending instruction.imm_se():

impl Cpu {
    // ...

    /// Store Word
    fn op_sw(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);
        let v    = self.reg(t);

        self.store32(addr, v);
    }
}

This version of SW should work correctly even if the offset i is negative.

SLL instruction

The next instruction is simply 0x00000000. Looks strange but it's perfectly valid. As always we start by reading the bits [31:26] which obviously gives us 0b000000. This value however can introduce a number of instructions, to figure out which one we need to read bits [5:0] which are again full zeroes. By looking at the instruction set reference we see that these value correspond to a "shift left logical" (SLL). If we decode the entire instruction we end up with:

sll $zero, $zero, 0

Obviously this instruction does absolutely nothing since it stores the result in $zero. This instruction is just the preferred way to encode a NOP2. There are many instruction in the MIPS architecture that behave like NOPs, for instance using the opcodes we've already encountered we can craft several other NOPs:

lui $zero, 0
ori $zero, $zero, 0
ori $zero, $4, 1234

And there are many others since almost anything targeting $zero is a NOP3. I think the SLL version is preferred simply because it has this noticeable encoding of being all 0s.

In this case I can only assume that the NOP is used as a delay, probably waiting for the previous SW instructions to take effect but I'm not entirely sure why it's needed.

In our emulator we won't special-case this particular instruction, we can just implement the generic SLL instruction in full. Since NOPs are pretty common it might make some sense to special-case them but we'll need to benchmark it to make sure the cost of the test won't be greater than computing a useless shift and storing it in $zero.

Let's start by implementing the accessors (the shift immediate is only 5bits since it wouldn't make sense to shift by more than 31 places and the rest of the low bits is taken by the "subfunction" part of the instruction):

impl Instruction {
    // ...

    /// Return register index in bits [15:11]
    fn d(self) -> u32 {
        let Instruction(op) = self;

        (op >> 11) & 0x1f
    }

    /// Return bits [5:0] of the instruction
    fn subfunction(self) -> u32 {
        let Instruction(op) = self;

        op & 0x3f
    }

    /// Shift Immediate values are stored in bits [10:6]
    fn shift(self) -> u32 {
        let Instruction(op) = self;

        (op >> 6) & 0x1f
    }
}

Now that we have our fancy getters ready to parse the instruction we can implement the opcode itself:

impl Cpu {
    // ...

    fn decode_and_execute(&mut self, instruction: Instruction) {
        match instruction.function() {
            0b000000 => match instruction.subfunction() {
                0b000000 => self.op_sll(instruction),
                _        => panic!("Unhandled instruction {:08x}",
                                   instruction.0),
            },
            0b001111 => self.op_lui(instruction),
            0b001101 => self.op_ori(instruction),
            0b101011 => self.op_sw(instruction),
            _        => panic!("Unhandled instruction {:08x}",
                               instruction.0),
        }
    }

    /// Shift Left Logical
    fn op_sll(&mut self, instruction: Instruction) {
        let i = instruction.shift();
        let t = instruction.t();
        let d = instruction.d();

        let v = self.reg(t) << i;

        self.set_reg(d, v);
    }
}

Obviously in this case it won't do anything since it's a NOP but it should work correctly when we encounter a "real" SLL instruction.

ADDIU instruction

After that we encounter the instruction "0x24080b88" which is the "Add Immediate Unsigned" opcode. The name is completely misleading: it seems to say that the immediate value is treated as unsigned (i.e. not zero-extended instead of sign-extended) but it's not the case. The only difference between ADDIU and ADDI ("Add Immediate") is that the latter generates an exception on overflow while the former simply truncates the result. How they got to "unsigned" from that I have no idea...

Knowing that it's easy to implement it in our emulator4:

impl Cpu {
    // ...

    /// Add Immediate Unsigned
    fn op_addiu(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let v = self.reg(s).wrapping_add(i);

        self.set_reg(t, v);
    }
}

If you decode the instruction in full you should end up with:

addiu $8, $zero, 0xb88

You can see an other use of the $zero register: this time with the ADDIU opcode it sets $8 to the immediate value 0xb88. It saves having a dedicated "Load immediate" opcode.

RAM configuration register

This value of 0x00000b88 is then stored at address 0x1f801060.

This register is called RAM_SIZE in the NoCash specs. The exact purpose of this register remains partially unknown but it seems to be configuring the memory controller. I assume that this controller is capable of handling various amounts of RAM for instance and this register lets the BIOS load the particular configuration needed by the Playstation hardware.

At any rate, since we're trying to emulate the Playstation and not some generic MIPS computer we probably don't have to handle this register in any specific way so it's hopefully safe to ignore it. I just add a new mapping entry, ignore the store at this address and move along:

/// Register that has something to do with RAM configuration,
/// configured by the BIOS
pub const RAM_SIZE: Range = Range(0x1f801060, 4);

After this instruction we get a few NOPs. I suppose that the ram size configuration takes a few cycle to take effect and the BIOS delays a bit before continuing.

J instruction

The next instruction is 0x0bf00054 which is a jump instruction (J). This function is used to change the value of the PC and have the CPU execution pipeline jump to some other location in memory.

Jump behaves like a goto: it sets the PC to the immediate value contained in the instruction. Since the instruction is 32bit wide and the instruction set uses 6bits to encode the opcode it can only specify 26bits of the 'PC' at once.

To make the most of those 26bits the target address is shifted two places to the right. It's not a problem because instructions must be aligned to a 32bit boundary so the two LSBs of the PC always have to be zero. It means that the instruction really encodes 28bits of the target address. The remaining 4 high bits are the PC's MSB and remain untouched. In the case of our current instruction this makes the target address 0xbfc00150.

You can see that this instruction cannot jump anywhere in RAM, only to an address within the current 256MB of addressable memory. If the CPU needs to jump further away5 it'll have to use an other instruction like JR which takes a full 32bit register containing the destination address. But we'll see that soon enough.

First we need to add an accessor for the 26bit immediate field:

impl Instruction {
    // ...

    /// Jump target stored in bits [25:0]
    fn imm_jump(self) -> u32 {
        let Instruction(op) = self;

        op & 0x3ffffff
    }
}

Now we can implement the instruction itself:

impl Cpu {
    // ...

    /// Jump
    fn op_j(&mut self, instruction: Instruction) {
        let i = instruction.imm_jump();

        self.pc = (self.pc & 0xf0000000) | (i << 2);
    }
}

Looks simple enough but unfortunately it's broken. Why you ask?

Branch delay slots {#sec:bds}

The reason our implementation of "jump" doesn't work properly is because one of the simplifying assumptions we made when we started implementing the CPU does not hold in this case.

Remember when I said that the CPU fetches and execute an instruction at each cycle, increments the PC and repeats? Well it's a bit more complicated than that.

The MIPS architecture is pipelined It means that in order to increase the throughput of the processor it splits its execution logic across several stages.

While one stage is busy decoding an instruction the instruction fetch stage could already be loading the next one. It works like an assembly line.

When the code executes linearly (i.e. without jumps or branches) there's no problem: while the CPU decodes the instruction at PC the instruction fetch stage can start loading the value at PC + 4.

But if the instruction being decoded is a jump or a branch things get messy. The instruction fetch stage cannot know that the previous instruction is supposed to change the execution path. When the instruction reaches the execution stage the value of PC gets updated, but it's too late a spurious instruction has been fetched into the pipeline already.

So there you are, with an unwanted instruction in your pipeline. What do you do?

Some architectures opt for flushing the pipeline in those cases. You restart from the correct address. Of course that's a costly operation: your CPU has to wait for the fresh instructions to make it all the way through the pipeline before getting executed. Many modern architectures do that and that's why they generally include complex branch predictors which do their best to guess if a branch is about to be taken. If they make a bad prediction the pipeline has to be flushed. That's one of the main reasons branches are considered expensive (and why I always overwrite regs[0] in set_reg instead of checking if the register was 0).

MIPS however doesn't do that. It doesn't bother wasting time flushing the pipeline, it just ignore the issues and run the code anyway. What this means is that the first instruction right after a branch always gets executed before the branch is taken, unconditionaly. This instruction is said to be in the branch delay slot

Consider the following assembly6:

j   foo
lui $a0, 0xf00

The LUI instruction gets executed before the code jumps to foo. When the function is entered $a0 will be equal to 0x0f000000.

Fortunately it's pretty easy to emulate this behaviour: we just have to do the same thing the processor does and load the next instruction before we execute the current one:

/// CPU state
pub struct Cpu {
    /// The program counter register
    pc: u32,
    /// Next instruction to be executed, used to simulate the branch
    /// delay slot
    next_instruction: Instruction,

    // ...
}

impl Cpu {

    pub fn new(inter: Interconnect) -> Cpu {
        // ...

        Cpu {
            // PC reset value at the beginning of the BIOS
            pc: 0xbfc00000,
        // ...
            next_instruction: Instruction(0x0), // NOP
        }
    }

    pub fn run_next_instruction(&mut self) {
        let pc = self.pc;

        // Use previously loaded instruction
        let instruction = self.next_instruction;

        // Fetch instruction at PC
        self.next_instruction = Instruction(self.load32(pc));

        // Increment PC to point to the next instruction. All
        // instructions are 32bit long.
        self.pc = pc + 4;

        self.decode_and_execute(instruction);
    }

    // ...
}

And now our jumps should behave correctly.

OR instruction

After the jump there's a sequence of LUI/ORI/SW used to store a bunch of values in the SYS_CONTROL registers that we chose to ignore. We then stumbble upon a new instruction: 0x00000825 which encodes a bitwise or operation:

or $1, $zero, $zero

Unlike ORI which used an immediate value as a 2nd operand this one takes two register and stores the result in a third one. We can see that in this case the two source registers are $zero so it just clears $1. The implementation is fairly straightforward:

impl Cpu {
    // ...

    /// Bitwise Or
    fn op_or(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let v = self.reg(s) | self.reg(t);

        self.set_reg(d, v);
    }
}

The next few instructions use OR to set all the general purpose registers to 0.

Type safety in the register interface

I've decided to make a modification to our Instruction interface: right now the helper methods in the Instruction return register indexes as u32. The same type as the values contained in the registers. Therefore the compiler won't warn us if we mess up and use a register index instead of a register value:

impl Cpu {
    // ...

    /// Bitwise Or
    fn op_or(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let v = s | self.reg(t); // Oops...

        self.set_reg(d, v);
    }
}

This code is broken: instead of OR-ing the value of the register number s it ORs the index s itself. It's meaningless and obviously wrong and yet it builds without any error.

Fortunately with a small modification in our code we can have the compiler reject such code by wrapping register indexes in a new type incompatible with u32:

struct RegisterIndex(u32);

Note that this is not like a typedef in C or C++: typedef just creates an alias which remains compatible (i.e. interchangeable) with the original type. The equivalent in C would be to wrap the u32 in a struct or something like that.

Then we just have to update our helpers as well as the Cpu::reg and Cpu::set_reg methods to use a RegisterIndex instead of a plain u32.

With this modification the compiler will reject the broken op_or implementation above:

Binary operation | cannot be applied to type cpu::RegisterIndex:
     let v = s | self.reg(t);

Hurray for type safety!

CACHE_CONTROL register

The BIOS then wants to write 0x00000804 to 0xfffe0130. This address is used for cache control. Since we won't implement the caches yet we can just add a log message and ignore this register for the moment:

/// Cache control register
pub const CACHE_CONTROL: Range = Range(0xfffe0130, 4);

The coprocessors

The next unhandled instruction, 0x408c6000, involves one of the R3000 CPU coprocessors.

Coprocessors are pieces of hardware which live alongside the CPU and are accessed through dedicated instructions (instead of memory mapped I/O like external peripherals). They are used to complement and extend the capabilities of the processor. They each have their own set of registers.

The MIPS R3000 CPU can support up to 4 coprocessors:

  • The coprocessor 0 (cop0) is mandated by the MIPS architecture: it's used for exception handling. Exceptions are things like hardware interrupts and traps (divisions by zero, integer overflows, system calls etc...). We'll study them in greater details when we'll implement them.

  • The coprocessor 1 (cop1) is optional: when available it's used for floating point arithmetic. You might expect that a videogame console would benefit greatly from having hardware accelerated floating point and yet cop1 is not implemented on the playstation! Instead we have the coprocessor 2.

  • The coprocessor 2 (cop2) is, as far as I know, custom made for the Playstation. At least I can't find any reference to it outside of the Playstation hardware. It's called the "Geometry Transformation Engine", or GTE for short. It implements many instructions dealing with 3D transforms like perspective transformations, vector and matrix multiplications, color manipulation etc... It's basically the first half of the rendering pipeline, the second half being the GPU (but that one is a memory mapped peripheral, not a coprocessor).

  • The coprocessor 3 (cop3) is not implemented on the Playstation.

Hopefully we shouldn't have to mess with the GTE until we start encountering 3D code.

MTC0 instruction

Back to the 0x408c6000 instruction: the opcode (bits [31:26]) is equal to 0b010000 which means that it's an instruction for the coprocessor 0. The generic format is 0b0100nn where nn is the coprocessor number.

impl Cpu {
    //...

    fn decode_and_execute(&mut self, instruction: Instruction) {
        match instruction.function() {
        //...
            0b010000 => self.op_cop0(instruction),
            _        => panic!("Unhandled instruction {}",
                               instruction),
        }
    }

    /// Coprocessor 0 opcode
    fn op_cop0(&mut self, instruction: Instruction) {
        match instruction.cop_opcode() {
            0b00100 => self.op_mtc0(instruction),
            _       => panic!("unhandled cop0 instruction {}",
                              instruction)
        }
    }
}

Instruction::cop_opcode returns the same bit range as Instruction::s, however it returns it as a plain u32 instead of a RegisterIndex (since it's not a register in this case). You see that the current coprocessor opcode 0b00100 means MTC0 or "move to coprocessor 0". This instruction takes two parameters: the source register index (one of the CPU's general registers) and the target register (one of the coprocessor's register). Those parameters are respectively in bits [20:16] and [15:11] of the instruction.

In our current instruciton both of those parameters are equal to 12 so if we decode the instruction in full it gives7:

mtc0 $12, $cop0_12

The coprocessor register $cop0_12 is very useful: it's called the "status register" or SR for short. Among other things it's used to query and mask the exceptions and controlling the cache behaviour.

At this point the $12 register contains 0x00010000 so this MTC0 instruction sets bit 16 of SR which is the "isolate cache" bit. It makes all the following read and write target directly the cache instead of going through it towards the main memory. We're probably in the middle of the cache initialization sequence.

At any rate since we still haven't implemented anything cache-related we'll just store the value of the SR in our Cpu struct and move along:

/// CPU state
pub struct Cpu {
    //...

    /// Cop0 register 12: Status Register
    sr: u32,
}

impl Cpu {
    //...

    pub fn new(inter: Interconnect) -> Cpu {
        //...

        Cpu {
        //...
        sr: 0,
        }
    }

    fn op_mtc0(&mut self, instruction: Instruction) {
        let cpu_r = instruction.t();
        let cop_r = instruction.d().0;

        let v = self.reg(cpu_r);

        match cop_r {
            12 => self.sr = v,
            n  => panic!("Unhandled cop0 register: {:08x}", n),
        }
    }
}

Setting the SR to 0 on reset might not be accurate but I doubt it matters much.

Since the cache is supposed to be isolated all "stores" should end up in the cache and never in the main memory. Even if we don't implement the cache we don't want the BIOS to start writing at random locations in main memory when it thinks it writes to the cache so we can start by ignoring all writes when this isolation bit is set:

impl Cpu {
    //...

    /// Store Word
    fn op_sw(&mut self, instruction: Instruction) {

        if self.sr & 0x10000 != 0 {
            // Cache is isolated, ignore write
            println!("ignoring store while cache is isolated");
            return;
        }

    //...
    }
}

BNE instruction

We now encounter the instruction 0x154bfff7. It encodes a BNE or "branch if not equal" instruction. The difference between jumps and branches is that branches are conditional and they use relative offsets.

The immediate value is sign extended (in order to allow for negative offsets) and multiplied by 4 (as always, the PC must be aligned to 32bits at all times). Therefore this instruction decodes to:

bne $10, $11, -36

In other words the instruction will compare the values in $10 and $11 and if they're unequal it'll subtract 36 from the PC. If the values are equal it'll do absolutely nothing.

Like jumps, branches have a delay slot8. Fortunately our implementation in section 2.23{reference-type="ref" reference="sec:bds"} already takes care of that without any more work.

I've decided to factor the "branching" code itself in a separate function because we'll have to use the same logic in the other branch instructions:

impl Cpu {
    //...

    /// Branch to immediate value `offset`.
    fn branch(&mut self, offset: u32) {
        // Offset immediates are always shifted two places to the
        // right since `PC` addresses have to be aligned on 32bits at
        // all times.
        let offset = offset << 2;

        let mut pc = self.pc;

        pc = pc.wrapping_add(offset);

        // We need to compensate for the hardcoded
        // `pc.wrapping_add(4)` in `run_next_instruction`
        pc = pc.wrapping_sub(4);

        self.pc = pc;
    }

    /// Branch if Not Equal
    fn op_bne(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let s = instruction.s();
        let t = instruction.t();

        if self.reg(s) != self.reg(t) {
            self.branch(i);
        }
    }
}

Notice the wrapping_sub(4) to compensate for our pc.wrapping_add(4) in run_next_instruction. Without it we'd branch one instruction too far.

ADDI instruction

Before we even reach the target of the branch we stumble upon unhandled instruction 0x214a0080. This one is an ADDI which behaves exactly like the ADDIU instruction we've already implemented except that it generates an exception if the addition overflows.

The instruction decodes to:

addi $10, $10, 128

Since this operation checks for signed overflow I'll cast the operands to i32 before using the checked_add provided by rust's standard library9. For now I just panic if we encounter an overflow, we'll change that when we actually implement exceptions:

impl Cpu {
    //...

    /// Add Immediate Unsigned and check for overflow
    fn op_addi(&mut self, instruction: Instruction) {
        let i = instruction.imm_se() as i32;
        let t = instruction.t();
        let s = instruction.s();

        let s = self.reg(s) as i32;

        let v = match s.checked_add(i) {
            Some(v) => v as u32,
            None    => panic!("ADDI overflow"),
        };

        self.set_reg(t, v);
    }
}

The cast to i32 is important because something like 0x4 + 0xffffffff is an overflow in 32bit unsigned arithmetics. If the operands are signed however it's simply 4 + -1 and that's obviously perfectly fine. The actual result of the operation would be the same (0x00000003) but since ADDI generates an exception on overflow the difference in behaviour is critical.

Memory loads

The next unhandled instruction, 0x8d090000 is LW or "load word". It decodes to:

lw $9, 0($8)

We can reuse the load32 method to fetch the data from memory:

impl Cpu {
    //...

    /// Load Word
    fn op_lw(&mut self, instruction: Instruction) {

        if self.sr & 0x10000 != 0 {
            // Cache is isolated, ignore write
            println!("Ignoring load while cache is isolated");
            return;
        }

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        let v = self.load32(addr);

        self.set_reg(t, v);
    }
}

There's a subtle problem with this implementation however.

Load delay slots

Sounds familiar? It's our friend the pipeline messing with us once again. What happens is that the load instructions attempts to read from the memory, but that takes time. At least, it takes more than a single cycle.

On the R3000 CPU it creates "load delay slots": when you load a value from memory the CPU will execute the next instruction before the value is fetched into the target register10.

Consider this sequence of instructions:

lw   $1, 0($zero)  /* Load $1 with the value at address 0 */
move $2, $1        /* Move $1 in $2 */
move $3, $1        /* Move $1 in $3 */

The first MOVE instruction11 is in the load delay slot of the previous LW. That means that at that point the register $1 does not yet contain the value loaded into it. So after these two instructions $2 contains the value of $1 before the load. The 2nd MOVE however takes place after the load delay slot so $3 will contain the final, post-load value of $1.

But it gets worse. Consider the value of $1 after these two instructions:

lw    $1, 0($zero)  /* Load $1 with the value at address 0 */
addiu $1, $zero, 42 /* Put 42 in $1 */

We first use LW to load something in $1 and then, while the load takes place, we change the value of $1 with an ADDIU instruction. Who wins?

You might think that since the LW finishes after the load delay slot its fetched value will override the one set by the ADDIU. It turns out that it's not the case however: after those two instructions $1 will contain 42, no matter what the LW fetched.

It's a bit of a bad news for us emulator writers. It means we can't execute the load before the delay slot because the instruction must see the previous value of the loaded register (otherwise the first example code above won't work) and we can't just execute it afterwards because it would make the load take the priority over the delay slot (thus breaking our 2nd example).

One way to see it is that the loaded value ends up in the target register after the next instruction has fetched the input register values but before the next instruction updates the target register values. In our first example $1 is an input register to both MOVEs while in the 2nd it's the output (destination) register of the ADDIU.

We could implement it exactly that way by splitting each instruction in two:

  • The first part would take the pre-load register values, compute the result (adding $zero and 10 in the 2nd example example above),

  • Then it would execute any pending load,

  • Finally it would store the result of the computation in the target register ($1 in the ADDIU). That way the ADDIU will write last.

I don't really like this solution however because we'll have to handle load delays explicitly in all instructions which seems inelegant and error-prone.

Instead I'm going to use two sets of general purpose registers: one will be the input set and the other the output set. Each instruction will read its input values from the former set and will write to the latter. Once the instruction is finished we copy the output set into the input set for the next instruction.

This way we can update the output register set with the load value before we execute the instruction and it will still see the old value from the input set. And if the instruction writes to the same register it will overwrite the value in the output set.

Hopefully it should be clearer in code. First let's add a 2nd set of registers and a (register, value) tuple containing the pending load:

/// CPU state
pub struct Cpu {
    //...

    /// 2nd set of registers used to emulate the load delay slot
    /// accurately. They contain the output of the current
    /// instruction.
    out_regs: [u32; 32],
    /// Load initiated by the current instruction
    load: (RegisterIndex, u32),
}

impl Cpu {

    pub fn new(inter: Interconnect) -> Cpu {
        //...

        Cpu {
            //...
        out_regs: regs,
            load: (RegisterIndex(0), 0),
        }
    }

    //...
}

If no load is pending we can just target $zero since it doesn't do anything.

Now we can update the set_reg method to target the output register set:

impl Cpu {
    //...

    fn set_reg(&mut self, index: RegisterIndex, val: u32) {
        self.out_regs[index.0 as usize] = val;

        // Make sure R0 is always 0
        self.out_regs[0] = 0;
    }
}

Since all our instructions so far use this helper method to update the register values we won't have to modify their code at all.

The next step is to update run_next_instruction to handle pending loads and copying the output registers between every instructions:

impl Cpu {
    //...

    pub fn run_next_instruction(&mut self) {
        //...

        // Execute the pending load (if any, otherwise it will load
        // $zero which is a NOP). `set_reg` works only on
        // `out_regs` so this operation won't be visible by
        // the next instruction.
        let (reg, val) = self.load;
        self.set_reg(reg, val);

        // We reset the load to target register 0 for the next
        // instruction
        self.load = (RegisterIndex(0), 0);

        self.decode_and_execute(instruction);

        // Copy the output registers as input for the
        // next instruction
        self.regs = self.out_regs;
    }
}

You can see that we're copying 128 bytes worth of registers for each instruction which might not be great performance-wise but at this point I don't really care about that.

LW instruction

We can now write the correct, load-delay friendly implementation of SW:

impl Cpu {
    //...

    /// Load Word
    fn op_lw(&mut self, instruction: Instruction) {

        if self.sr & 0x10000 != 0 {
            // Cache is isolated, ignore write
            println!("Ignoring load while cache is isolated");
            return;
        }

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        let v = self.load32(addr);

        // Put the load in the delay slot
        self.load = (t, v);
    }
}

The RAM {#sec:ram}

Unfortunately we can't test our brand new load delay slot just yet because the current instruction attemps to load from an unhandled address: 0xa0000000. The memory map[tab:mmap]{reference-type="ref" reference="tab:mmap"} tells us that this is the first address in RAM.

Adding RAM support is straightforward: it's very similar to our BIOS implementation except it boots up uninitialized and it's not read-only:

/// RAM
pub struct Ram {
    /// RAM buffer
    data: Vec<u8>
}

impl Ram {

    /// Instantiate main RAM with garbage values
    pub fn new() -> Ram {

        // Default RAM contents are garbage
        let data = vec![0xca, 2 * 1024 * 1024];

        Ram { data: data }
    }

    /// Fetch the 32bit little endian word at `offset`
    pub fn load32(&self, offset: u32) -> u32 {
        let offset = offset as usize;

        let b0 = self.data[offset + 0] as u32;
        let b1 = self.data[offset + 1] as u32;
        let b2 = self.data[offset + 2] as u32;
        let b3 = self.data[offset + 3] as u32;

        b0 | (b1 << 8) | (b2 << 16) | (b3 << 24)
    }

    /// Store the 32bit little endian word `val` into `offset`
    pub fn store32(&mut self, offset: u32, val: u32) {
        let offset = offset as usize;

        let b0 = val as u8;
        let b1 = (val >> 8) as u8;
        let b2 = (val >> 16) as u8;
        let b3 = (val >> 24) as u8;

        self.data[offset + 0] = b0;
        self.data[offset + 1] = b1;
        self.data[offset + 2] = b2;
        self.data[offset + 3] = b3;
    }
}

I arbitrarily chose 0xca as the poison value on startup. It's pretty strange that the BIOS attempts to fetch data from the RAM before writing anything to it (and effectively reading garbage) but if you look at the following instructions it repeatedly reads the same address (the first word in RAM) and does nothing with it. I'm not sure what this code does but it probably initializes something. Let's hope it's not too important...

We can then plug our brand new RAM in the interconnect as usual:

pub const RAM: Range = Range(0xa0000000, 2 * 1024 * 1024);

The coprocessor 0 registers

After that the BIOS wants to initialize the remaining cop0 registers by loading $zero into them with the MTC0 instruction.

Let's take the time to review those registers:

  • $cop0_3 is BPC, used to generate a breakpoint exception when the PC takes the given value.

  • $cop0_5 is BDA, the data breakpoint. It's like BPC except it breaks when a certain address is accessed on a data load/store instead of a PC value.

  • $cop0_6: I couldn't find a lot of informations on this register or what it does, the consensus seems to be that it's basically useless.

  • $cop0_7 is DCIC, used to enable and disable the various hardware breakpoints.

  • $cop0_9 is BDAM, it's a bitmask applied when testing for BDA above. That way we could trigger on a range of address instead of a single one.

  • $cop0_11 is BPCM, like BDAM but for masking the BPC breakpoint.

  • $cop0_12 we've already encountered: it's SR, the status register.

  • $cop0_13 is CAUSE, which contains mostly read-only data describing the cause of an exception. Apparently only bits [9:8] are writable to force an exception.

You can see that most of those registers (except SR and CAUSE) deal with hardware breakpoints. That's generally used for debugging so we shouldn't need to emulate those for most games. It's probably safe to ignore for now. You can see that the BIOS loads $zero into all of them which disables them.

For now we're just going to ignore write to these registers when the value is 0. If at some point some game writes something else we'll catch it and see what we need to implement:

impl Cpu {
    //...

    /// Move To Coprocessor 0
    fn op_mtc0(&mut self, instruction: Instruction) {
        let cpu_r = instruction.t();
        let cop_r = instruction.d().0;

        let v = self.reg(cpu_r);

        match cop_r {
            3 | 5 | 6 | 7 | 9 | 11  => // Breakpoints registers
                if v != 0 {
                    panic!("Unhandled write to cop0r{}", cop_r)
                },
            12 => self.sr = v,
            13 => // Cause register
                if v != 0 {
                    panic!("Unhandled write to CAUSE register.")
                },
            _  => panic!("Unhandled cop0 register {}", cop_r),
        }
    }
}

SLTU instruction

After that we encounter the instruction 0x0043082b which encodes the "set on less than unsigned"(STLU) opcode:

sltu $1, $2, $3

This instruction compares the value of two registers ($2 and $3 in this case) and sets the value of a third one ($1) to either 0 or 1 depending on the result of the "less than" comparison:

impl Cpu {
    //...

    /// Set on Less Than Unsigned
    fn op_sltu(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let v = self.reg(s) < self.reg(t);

        self.set_reg(d, v as u32);
    }
}

ADDU instruction

We then stumble upon the instruction 0x03a0f021 which encodes an "Add unsigned" (ADDU) opcode:

addu $30, $29, $zero

You can see that with $zero as the third operand it simply moves $29 in $30, so in this case it's really a MOVE instruction.

The instruction is implemented like ADDIU except that we add two registers instead of a register and an immediate value:

impl Cpu {
    //...

    /// Add Unsigned
    fn op_addu(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();
        let d = instruction.d();

        let v = self.reg(s).wrapping_add(self.reg(t));

        self.set_reg(d, v);
    }
}

Regions

Our next problem is an unhandled access at address 0x00000060. If we look at the memory map[tab:mmap]{reference-type="ref" reference="tab:mmap"} we see that it's the RAM. But we've already added the RAM in our interconnect in section 2.34{reference-type="ref" reference="sec:ram"}!

The problem is that currently we mapped the RAM at 0xa0000000, in the KSEG1 region. But this time the BIOS attempts to access it through an other region: KUSEG. We could add multiple mappings for each peripheral in each region but that would be a waste of code and performance.

Let's a closer look at how those regions are specified by the MIPS architecture:

  • KSEG0 starts at 0x80000000 and ends at 0x9fffffff. This region is accessed through the caches but it's not mapped through the MMU. In order to get the physical address we just have to strip the MSB.

  • KSEG1 starts at 0xa0000000 and ends at 0xbfffffff. This region is not cached or mapped through the MMU. In order to get the physical address we just have to strip the three MSBs.

  • KSEG2 starts at 0xc0000000 and ends at 0xffffffff. This region is only accessed in kernel mode and is also cached and goes through the MMU.

  • KUSEG starts at 0x00000000 and ends at 0x7fffffff. It's meant for user code and is both cached and goes through the MMU.

All that sounds rather complicated. Fortunately for us since we're targeting the Playstation and not some generic MIPS architecture we'll be able to make some simplifications:

  • The Playstation hardware does not have a MMU and therefore no virtual memory. We won't have to deal with memory translation.

  • The Playstation CPU has 1KB of data cache and an other kilobyte of instruction cache. However the data cache is not used, instead its memory is mapped as the "scratpad" at a fixed location. In other word we don't need to implement the data cache.

  • As far as I can tell the Playstation software doesn't seem to use the kernel/user privilege separation and runs everything in kernel mode.

In other words the only time we'll need to worry about which region is in use is when we'll implement the cache instruction and only for KSEG1 since that's the only non-cached region.. For everything else it doesn't matter through which region the peripherals are accessed.

In order to solve our issue of having multiple mappings at different addresses for the same peripherals in different regions we want to compute the unique physical address corresponding to a memory access and map that through our interconnect code.

By the descriptions above you see that we should mask a different number of bits depending on the region. Since KSEG2 doesn't share anything with the other regions we won't touch the address here (otherwise we would allow access to the RAM through KSEG2 for instance and that wouldn't be accurate). In order to avoid branches we can use a nice mask lookup table:

/// Mask array used to strip the region bits of the address. The
/// mask is selected using the 3 MSBs of the address so each entry
/// effectively matches 512kB of the address space. KSEG2 is not
/// touched since it doesn't share anything with the other
/// regions.
const REGION_MASK: [u32; 8] = [
    // KUSEG: 2048MB
    0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
    // KSEG0:  512MB
    0x7fffffff,
    // KSEG1:  512MB
    0x1fffffff,
    // KSEG2: 1024MB
    0xffffffff, 0xffffffff,
];

/// Mask a CPU address to remove the region bits.
pub fn mask_region(addr: u32) -> u32 {
    // Index address space in 512MB chunks
    let index = (addr >> 29) as usize;

    addr & REGION_MASK[index]
}

We can now use this mask_region function in our interconnect's load and store functions to convert any address coming from the CPU into a unique physical address used to identify the target peripheral.

We also have to change all our current address map declarations to use physical addresses:

pub const RAM: Range = Range(0x00000000, 2 * 1024 * 1024);

pub const BIOS: Range = Range(0x1fc00000, 512 * 1024);

/// Unknown registers. The name comes from mednafen.
pub const SYS_CONTROL: Range = Range(0x1f801000, 36);

/// Register that has something to do with RAM configuration,
/// configured by the BIOS
pub const RAM_SIZE: Range = Range(0x1f801060, 4);

/// Cache control register. Full address since it's in KSEG2
pub const CACHE_CONTROL: Range = Range(0xfffe0130, 4);

SH instruction

The next unhandled instruction is 0xa5200180 which encodes "store halfword" (SH). It's used to write 16bits (a halfword) to the memory:

sh $zero, 0x180($9)

The implementation is very similar to the "store word" instruction except we truncate the register to 16bits and we'll have to implement a new store16 method on our interconnect12:

impl Cpu {
    //...

    /// Store 16bit value into the memory
    fn store16(&mut self, addr: u32, val: u16) {
        self.inter.store16(addr, val);
    }

    /// Store Halfword
    fn op_sh(&mut self, instruction: Instruction) {

        if self.sr & 0x10000 != 0 {
            // Cache is isolated, ignore write
            println!("Ignoring store while cache is isolated");
            return;
        }

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);
        let v    = self.reg(t);

        self.store16(addr, v as u16);
    }
}

And in the interconnect:

impl Interconnect {
    //...

    /// Store 16bit halfword `val` into `addr`
    pub fn store16(&mut self, addr: u32, val: u16) {

        if addr % 2 != 0 {
            panic!("Unaligned store16 address: {:08x}", addr);
        }

        panic!("unhandled store16 into address {:08x}", addr);
    }
}

I start with an empty function instead of copying the store32 code because different devices react differently when we change the transaction width. Some will pad the value to 32bits with zeroes, others may just set 16bits in the register and leave the others untouched. For this reason I'll be conservative and add them only when needed.

SPU registers

If we run that code we see that this store16 attempts to store 0 at 0x1f801d80. Looking at the memory map we see it's the address of a sound processing unit (SPU) hardware register. At that point we don't really care for sound so we're going to ignore writes to these addresses for the time being:

impl Interconnect {
    //...

    /// Store 16bit halfword `val` into `addr`
    pub fn store16(&mut self, addr: u32, _: u16) {

        if addr % 2 != 0 {
            panic!("Unaligned store16 address: {:08x}", addr);
        }

        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::SPU.contains(abs_addr) {
            println!("Unhandled write to SPU register {:x}", offset);
            return;
        }

        panic!("unhandled store16 into address {:08x}", addr);
    }
}

/// SPU registers
pub const SPU: Range = Range(0x1f801c00, 640);

JAL instruction

The next unhandled instruction should be 0x0ff00698 which is a "jump and link" (JAL). It behaves like the regular jump instruction except that it also stores the return address in $ra ($31):

jal 0xfc01a60

Using this instruction it's easy to implement function calls: the instruction is called with JAL and can return to the caller by jumping to the value in $ra. Then the control returns to the calling function. The $ra register is the link between the caller and the callee.

We can reuse the regular J opcode implementation and simply add the code to store the return value in $31:

impl Cpu {
    //...

    /// Jump And Link
    fn op_jal(&mut self, instruction: Instruction) {
        let ra = self.pc;

        // Store return address in $31 ($ra)
        self.set_reg(RegisterIndex(31), ra);

        self.op_j(instruction);
    }
}

ANDI instruction

We continue with instruction 0x308400ff which is a "bitwise and immediate" (ANDI):

andi $4, $4, 0xff

We can simply copy the implementation of ORI and replace the | with an &:

impl Cpu {
    //...

    /// Bitwise And Immediate
    fn op_andi(&mut self, instruction: Instruction) {
        let i = instruction.imm();
        let t = instruction.t();
        let s = instruction.s();

        let v = self.reg(s) & i;

        self.set_reg(t, v);
    }
}

SB instruction

After the word and halfword store instructions we now meet 0xa1c42041 which is a "store byte" (SB) instruction. We have to implement a third path for accessing the memory like we did for store32 and store32:

impl Cpu {
    //...

    /// Store 16bit value into the memory
    fn store8(&mut self, addr: u32, val: u8) {
        self.inter.store8(addr, val);
    }

    /// Store Byte
    fn op_sb(&mut self, instruction: Instruction) {

        if self.sr & 0x10000 != 0 {
            // Cache is isolated, ignore write
            println!("Ignoring store while cache is isolated");
            return;
        }

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);
        let v    = self.reg(t);

        self.store8(addr, v as u8);
    }
}

Expansion 2

The address being written to is 0x1f802041 which falls in the expansion 2 memory map. As far as I can tell this expansion is only used for debugging on development boards and doesn't do anything useful on real hardware. Therefore we'll just ignore writes to this expansion:

impl Interconnect {
    //...

    /// Store byte `val` into `addr`
    pub fn store8(&mut self, addr: u32, _: u8) {
        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::EXPANSION_2.contains(abs_addr) {
            println!("Unhandled write to expansion 2 register {:x}", offset);
            return;
        }

        panic!("unhandled store8 into address {:08x}", addr);
    }
}

/// Expansion region 2
pub const EXPANSION_2: Range = Range(0x1f802000, 66);

JR instruction

A few steps later we encounter 0x03e00008 which is the "jump register" (JR) instruction. It simply sets the PC to the value stored in one of the general purpose registers:

jr $31

Since JAL stores the return address in $31 we can return from a subroutine by calling jr $ra which is exactly what the BIOS is doing here.

impl Cpu {
    //...

    /// Jump Register
    fn op_jr(&mut self, instruction: Instruction) {
        let s = instruction.s();

        self.pc = self.reg(s);
    }
}

LB instruction

The next unhandled instruction is 0x81efe288 which encodes "load byte" (LB). As you can guess it's like LW except that it only loads 8bits from the memory13:

lb $15, -7544($15)

Since the general purpose registers are always 32bit LB only loads the low 8bits of the register. The byte is treated like a signed value so it's sign extended to the full 32bits. Of course like LW there's a load delay of one instruction. We can implement it like this14:

impl Cpu {
    //...

    /// Load 8bit value from the memory
    fn load8(&self, addr: u32) -> u8 {
        self.inter.load8(addr)
    }

    /// Load Byte (signed)
    fn op_lb(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        // Cast as i8 to force sign extension
        let v = self.load8(addr) as i8;

        // Put the load in the delay slot
        self.load = (t, v as u32);
    }
}

Next is the Interconnect implementation. The current instruction attempts to load from an address within the BIOS so we'll add support for it:

impl Interconnect {
    //...

    /// Load byte at `addr`
    pub fn load8(&self, addr: u32) -> u8 {
        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::BIOS.contains(abs_addr) {
            return self.bios.load8(offset);
        }

        panic!("unhandled load8 at address {:08x}", addr);
    }
}

And the implementation of load8 in the BIOS:

impl Bios {
    //...

    /// Fetch byte at `offset`
    pub fn load8(&self, offset: u32) -> u8 {
        self.data[offset as usize]
    }
}

BEQ instruction

We then get a new branch instruction: 0x11e0000c is "branch if equal" (BEQ):

beq $15, $zero, +48

We can reuse the code of BNE by changing the condition:

impl Cpu {
    //...

    /// Branch if Equal
    fn op_beq(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let s = instruction.s();
        let t = instruction.t();

        if self.reg(s) == self.reg(t) {
            self.branch(i);
        }
    }
}

Expansion 1

After that the BIOS attemps to read a byte at 0x1f000084. This is where the first expansion port is mapped. This expansion goes to the parallel port on the back of the early Playstation models.

If you look at the byte read by the first LB instruction above you'll see it's the first byte in a C-string: "Licensed by Sony Computer Entertainment Inc". Apparently in order to detect and validate the expansion the BIOS compares this hardcoded string with the values stored starting at offset 0x84 in the expansion.

We don't really have any reason to implement an expansion at that point so we'll return the default value when no expansion is present. Looking at mednafen's source code it seems to be full-ones15:

impl Interconnect {
    //...

    /// Load byte at `addr`
    pub fn load8(&self, addr: u32) -> u8 {
        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::BIOS.contains(abs_addr) {
            return self.bios.load8(offset);
        }

        if let Some(_) = map::EXPANSION_1.contains(abs_addr) {
            // No expansion implemented
            return 0xff;
        }

        panic!("unhandled load8 at address {:08x}", addr);
    }
}

RAM byte access

Now the BIOS wants to store a byte to the RAM but we haven't implemented that yet, let's fix that by implementing store8 and let's add load8 while we're at it:

impl Interconnect {
    //...

    /// Store byte `val` into `addr`
    pub fn store8(&mut self, addr: u32, val: u8) {
        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::RAM.contains(abs_addr) {
            return self.ram.store8(offset, val);
        }

    //...
    }

    /// Load byte at `addr`
    pub fn load8(&self, addr: u32) -> u8 {
        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::RAM.contains(abs_addr) {
            return self.ram.load8(offset);
        }

        //....
    }
}

And then in the RAM implementation:

impl Ram {
    //...

    /// Store the byte `val` into `offset`
    pub fn store8(&mut self, offset: u32, val: u8) {
        self.data[offset as usize] = val;
    }

    /// Fetch the byte at `offset`
    pub fn load8(&self, offset: u32) -> u8 {
        self.data[offset as usize]
    }
}

MFC0 instruction

We've already met MTC0, now we encounter the reciprocal instruction: 0x40026000 encodes "move from coprocessor 0" (MFC0)16:

mfc0 $2, $cop0_12

There's one important thing to note however: MFC instructions behave like memory loads and have a delay slot before the value is finally stored in the target register.

Fortunately we can simply re-use our load delay slots infrastructure:

impl Cpu {
    //...

    /// Move From Coprocessor 0
    fn op_mfc0(&mut self, instruction: Instruction) {
        let cpu_r = instruction.t();
        let cop_r = instruction.d().0;

        let v = match cop_r {
            12 => self.sr,
            13 => // Cause register
                panic!("Unhandled read from CAUSE register"),
            _  =>
                panic!("Unhandled read from cop0r{}", cop_r),
        };

        self.load = (cpu_r, v)
    }
}

AND instruction

An other easy instruction follows a few cycles later: 0x00412024 which is a "bitwise and" (AND):

and $4, $2, $1

We've already implemented OR so we can reuse the code, only changing the operator:

impl Cpu {
    //...

    /// Bitwise And
    fn op_and(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let v = self.reg(s) & self.reg(t);

        self.set_reg(d, v);
    }
}

ADD instruction

We already implemented ADDIU, ADDI and ADDU. We finally encounter "add" (ADD) in instruction 0x01094020:

add $8, $8, $9

It adds the value of two registers (like ADDU) but generates an exception on signed overflow (like ADDI):

impl Cpu {
    //...

    /// Add and generate an exception on overflow
    fn op_add(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();
        let d = instruction.d();

        let s = self.reg(s) as i32;
        let t = self.reg(t) as i32;

        let v = match s.checked_add(t) {
            Some(v) => v as u32,
            None    => panic!("ADD overflow"),
        };

        self.set_reg(d, v);
    }
}

Interrupt Control registers

The BIOS then attempts to write 0 at address 0x1f801074. Looking at the memory map this is the "Interrupt Mask" register.

This register is used to activate or ignore external interrupt signals (things like blanking interrupts from the GPU, timers, controller and memory card interrupts etc...).

Interrupts are a signal coming from the peripherals to the CPU to notify it that a certain event occurred (a timer reached its target value, a button was pressed on the controller etc...). This way the CPU doesn't have to waste time polling the status of the various peripherals, it can just wait for the interrupt notification.

Writing 0 to this register masks all interrupts so it seems that the BIOS wants to make sure it won't get interrupted before proceeding further.

There's an other interrupt control register right before that one at 0x1f801070. That one is called "Interrupt Status" and is used to query the status of the various interrupts (active or not).

Since we don't have any peripheral yet it wouldn't make sense to implement interrupts at that point, we're going to ignore writes to these addresses for now17:

impl Interconnect {
    //...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(offset) = map::IRQ_CONTROL.contains(abs_addr) {
            println!("IRQ control: {:x} <- {:08x}", offset, val);
            return;
        }

        panic!("unhandled store32 into address {:08x}", addr);
    }
}

/// Interrupt Control registers (status and mask)
pub const IRQ_CONTROL: Range = Range(0x1f801070, 8);

BGTZ instruction

The next unhandled instruction is 0x1ca00003 which is a "branch if greater than zero" (BGTZ):

bgtz $5, +12

It's similar to the BEQ and BNE we've already encountered but instead of comparing two registers it compares a single general purpose register to 0.

The comparison is done using signed integers. For unsigned integers the test would only ever be false if the register contained 0 and we can already test that with BNE:

bne $5, $zero, +12

So we have to be careful to cast to a signed integer before the comparison in our implementation:

impl Cpu {
    //...

    /// Branch if Greater Than Zero
    fn op_bgtz(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let s = instruction.s();

        let v = self.reg(s) as i32;

        if v > 0 {
            self.branch(i);
        }
    }
}

BLEZ instruction

A few step later we encounter the complementary instruction 0x18a00005 which encodes "branch if less than or equal to zero" (BLEZ):

blez $5, +20

It's the same thing as BGTZ with the opposite predicate:

impl Cpu {
    //...

    /// Branch if Less than or Equal to Zero
    fn op_blez(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let s = instruction.s();

        let v = self.reg(s) as i32;

        if v <= 0 {
            self.branch(i);
        }
    }
}

LBU instruction

After that we meet instruction 0x90ae0000 which is a "load byte unsigned" (LBU):

lbu $14, 0($5)

It's exactly like LB but without sign extension, the high 24 bits of the target register are set to 0:

impl Cpu {
    //...

    /// Load Byte Unsigned
    fn op_lbu(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        let v = self.load8(addr);

        // Put the load in the delay slot
        self.load = (t, v as u32);
    }
}

JALR instruction

Then we encounter instruction 0x0100f809 which encodes a "jump and link register" (JALR):

jalr $31, $8

It's implemented like JR except that it also stores the return address in a general purpose register. Unlike JAL, JALR can store the return address in any general purpose register, not just $ra:

impl Cpu {
    //...

    /// Jump And Link Register
    fn op_jalr(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();

        let ra = self.pc;

        // Store return address in `d`
        self.set_reg(d, ra);

        self.pc = self.reg(s);
    }
}

BLTZ, BLTZAL, BGEZ and BGEZAL instructions

The next unhandled instruction, 0x04800003, is a bit of a weird one: the six MSBs are 0b000001 which can encode four different instructions:

  • "branch if less than zero" (BLTZ):

    bltz   $4, +12
    
  • "branch if less than zero and link" (BLTZAL):

    bltzal $4, +12
    
  • "branch if greater than or equal to zero" (BGEZ):

    bgez   $4, +12
    
  • "branch if greater than or equal to zero and link" (BGEZAL):

    bgezal $4, +12
    

In order to figure out what to do exactly we need to look at bits 16 and 20 in the instruction:

  • If bit 16 is set then the instruction is BGEZ, otherwise it's BLTZ.

  • If bits [20:17] are equal to 0x80 then the return address is linked in $ra.

Note that when linking is requested the return address is linked in $ra even if the branch is not taken.

Here's how it can be implemented:

impl Cpu {
    //...

    /// Various branch instructions: BGEZ, BLTZ, BGEZAL, BLTZAL.
    /// Bits 16 and 20 are used to figure out which one to use.
    fn op_bxx(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let s = instruction.s();

        let instruction = instruction.0;

        let is_bgez = (instruction >> 16) & 1;
        let is_link = (instruction >> 17) & 0xf == 8;

        let v = self.reg(s) as i32;

        // Test "less than zero"
        let test = (v < 0) as u32;

        // If the test is "greater than or equal to zero" we need
        // to negate the comparison above since
        // ("a >= 0" <=> "!(a < 0)"). The xor takes care of that.
        let test = test ^ is_bgez;

        if is_link {
            let ra = self.pc;

            // Store return address in R31
            self.set_reg(RegisterIndex(31), ra);
        }

        if test != 0 {
            self.branch(i);
        }
    }
}

Instead of testing bit 16 directly I save a branch by xoring the value of test (which is a boolean 0 or 1) with it.

SLTI instruction

We then encounter 0x28810010 which encodes instruction "set if less than immediate" (SLTI):

slti $1, $4, 16

It works like SLTU except that it compares a register with an immediate value (sign-extended) and the comparison is done using signed arithmetics:

impl Cpu {
    //...

    /// Set if Less Than Immediate (signed)
    fn op_slti(&mut self, instruction: Instruction) {
        let i = instruction.imm_se() as i32;
        let s = instruction.s();
        let t = instruction.t();

        let v = (self.reg(s) as i32) < i;

        self.set_reg(t, v as u32);
    }
}

SUBU instruction

The next unhandled instruction is 0x01c47023 which encodes "substract unsigned" (SUBU):

subu $14, $14, $4

The implementation is straightforward:

impl Cpu {
    //...

    /// Substract Unsigned
    fn op_subu(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();
        let d = instruction.d();

        let v = self.reg(s).wrapping_sub(self.reg(t));

        self.set_reg(d, v);
    }
}

SRA instruction

Next we meet instruction 0x00042603 which is "shift right arithmetic" (SRA):

sra $4, $4, 24

There are two versions of the shift right instruction: arithmetic and logical. The arithmetic version considers that the value is signed and use the sign bit to fill the missing MSBs in the register after the shift.

In Rust, C and C++ we can achieve the same behavior by casting the register value to a signed integer before doing the shift:

impl Cpu {
    //...

    /// Shift Right Arithmetic
    fn op_sra(&mut self, instruction: Instruction) {
        let i = instruction.shift();
        let t = instruction.t();
        let d = instruction.d();

        let v = (self.reg(t) as i32) >> i;

        self.set_reg(d, v as u32);
    }
}

DIV instruction

The next unhandled instruction is 0x0061001a which is "divide" (DIV):

div $3, $1

Multiplications and divisions are a bit peculiar on the MIPS architecture: for one, the result is not stored in general purpose registers but in two dedicated 32bit registers: HI and LO.

For a division LO will contain the quotient and HI the remainder of the euclidean division.

The reason for this is that divisions and multiplications are typically much slower than the other instructions we've implemented so far (with the exception of loads and stores potentially, due to the memory latency). While a simple ADD or SRA can be executed in a single CPU cycle, DIV can take as much as 36 cycles to get the result.

In order to try and hide this delay when the CPU executes a division instruction it does not stall the pipeline waiting for the instruction to finish. Rather it continues executing the following instructions and when the code decides to fetch the result of the division (using dedicated instructions to load HI or LO) the CPU only stalls if it didn't have the time to finish doing the division in the background. This way if you craft your assembly cleverly you can hide the division delay by doing some other work while the division is finishing.

For now we haven't bothered implementing accurate timings at all so we won't worry about these details and consider the division takes one cycle to execute. Later on when we implement proper timings we'll have to revisit that code.

Numerator Denominator Quotient (LO) Remainder (HI)


$\ge 0$ 0 -1 (0xffffffff) numerator $&lt; 0$ 0 +1 numerator 0x80000000 0xffffffff 0x80000000 0

: Special cases in divisions[]{label="tab:divspecial"}

An important thing to consider is what happens when we encounter a division by zero. Perhaps surprisingly the CPU does not generate an exception, it just gives bogus values (1 or -1 depending on the sign of the dividend).

An other bogus behaviour would be to divide 0x80000000 (-2147483648) by 0xffffffff (-1) which would yield 2147483648 which does not fit in a 32bit signed integer. Table [tab:divspecial]{reference-type="ref" reference="tab:divspecial"} gives a summary of those special cases.

We should now have all we need to implement the instruction, let's start by adding the HI and LO registers to our Cpu:

/// CPU state
pub struct Cpu {
    //...

    /// HI register for division remainder and multiplication high
    /// result
    hi: u32,
    /// LO register for division quotient and multiplication low
    /// result
    lo: u32,
}

impl Cpu {

    pub fn new(inter: Interconnect) -> Cpu {
        //...

        Cpu {
        //...
            hi: 0xdeadbeef,
            lo: 0xdeadbeef,
        }
    }

    //...
}

And now the implementation of the DIV opcode itself:

impl Cpu {
    //...

    /// Divide (signed)
    fn op_div(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();

        let n = self.reg(s) as i32;
        let d = self.reg(t) as i32;

        if d == 0 {
            // Division by zero, results are bogus
            self.hi = n as u32;

            if n >= 0 {
                self.lo = 0xffffffff;
            } else {
                self.lo = 1;
            }
        } else if n as u32 == 0x80000000 && d == -1 {
            // Result is not representable in a 32bit
            // signed integer
            self.hi = 0;
            self.lo = 0x80000000;
        } else {
            self.hi = (n % d) as u32;
            self.lo = (n / d) as u32;
        }
    }
}

MFLO instruction

We've seen that divisions store their results in the HI and LO registers but we don't know how we access those yet. Unsurprisingly the next unhandled instruction does just that: 0x00001812 encodes "move from LO" (MFLO):

mflo $3

This instruction simply moves the contents of LO in a general purpose register. This instruction would also stall if the division was not yet done but we'll implement that later:

impl Cpu {
    //...

    /// Move From LO
    fn op_mflo(&mut self, instruction: Instruction) {
        let d = instruction.d();

        let lo = self.lo;

        self.set_reg(d, lo);
    }
}

SRL instruction

We've implemented SRA not long ago, now we encounter the sister instruction 0x00057082 which is a "shift right logical" (SRL):

srl $14, $5, 2

It's very similiar to SRA except that the instruction treats the value as unsigned and fills the missing MSBs with 0 after the shift. In Rust, C and C++ we can achieve this behavior by shifting unsigned values:

impl Cpu {
    //...

    /// Shift Right Logical
    fn op_srl(&mut self, instruction: Instruction) {
        let i = instruction.shift();
        let t = instruction.t();
        let d = instruction.d();

        let v = self.reg(t) >> i;

        self.set_reg(d, v);
    }
}

SLTIU instruction

After that we meet 0x2c410045 which is "set if less than immediate unsigned" (SLTI):

sltiu $1, $2, 0x45

It's implemented like SLTI but using unsigned integers18:

impl Cpu {
    //...

    /// Set if Less Than Immediate Unsigned
    fn op_sltiu(&mut self, instruction: Instruction) {
        let i = instruction.imm_se();
        let s = instruction.s();
        let t = instruction.t();

        let v = self.reg(s) < i;

        self.set_reg(t, v as u32);
    }
}

DIVU instruction

Now we encounter the other division instruction: 0x0064001b which encodes "divide unsigned" (DIVU):

divu $3, $4

Since this version uses unsigned operands we only have one special case: the division by zero (the first line in table [tab:divspecial]{reference-type="ref" reference="tab:divspecial"}). Thus the implementation is slightly shorter than DIV:

impl Cpu {
    //...

    /// Divide Unsigned
    fn op_divu(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();

        let n = self.reg(s);
        let d = self.reg(t);

        if d == 0 {
            // Division by zero, results are bogus
            self.hi = n;
            self.lo = 0xffffffff;
        } else {
            self.hi = n % d;
            self.lo = n / d;
        }
    }
}

MFHI instruction

We already implemented MFLO, now we meet instruction 0x0000c810 which encodes "move from HI" (MFHI):

mfhi $25

Like MFLO it should be able to stall if the operation has not yet finished but we'll implement that later:

impl Cpu {
    //...

    /// Move From HI
    fn op_mflo(&mut self, instruction: Instruction) {
        let d = instruction.d();

        let hi = self.hi;

        self.set_reg(d, hi);
    }
}

SLT instruction

The next unhandled instruction is 0x0338082a which is "set on less than":

slt $1, $25, $24

It's like SLTU but with signed operands:

impl Cpu {
    //...

    /// Set on Less Than (signed)
    fn op_slt(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let s = self.reg(s) as i32;
        let t = self.reg(t) as i32;

        let v = s < t;

        self.set_reg(d, v as u32);
    }
}

Interrupt Control read

The BIOS then attempts to read from the Interrupt Mask register. Earlier we just ignored writes to this register (and the Interrupt Status register) so for now we'll return 0. We'll rewrite this code when we decide to implement interrupts:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {
        //...

        if let Some(offset) = map::IRQ_CONTROL.contains(abs_addr) {
            println!("IRQ control read {:x}", offset);
            return 0;
        }

    panic!("unhandled load32 at address {:08x}", addr);
    }
}

Timer registers

After that the BIOS wants to write 0 to 0x1f801104. This address is one of the timer registers. Timers are basically just configurable counters that can generate interrupts at a predetermined rate. There are three independent timers on the Playstation.

For now we won't have to actually implement them though because the BIOS just initializes them to a default disabled state by writing 0 to all the configuration registers. We can just ignore those writes and move along:

impl Interconnect {
    //...

    /// Store 16bit halfword `val` into `addr`
    pub fn store16(&mut self, addr: u32, _: u16) {
        //...

        if let Some(offset) = map::TIMERS.contains(abs_addr) {
            println!("Unhandled write to timer register {:x}",
                     offset);
            return;
        }

        panic!("unhandled store16 into address {:08x}", addr);
    }
}

Exceptions

The next unhandled instruction is 0x0000000c which encodes a "system call" (SYSCALL):

syscall 0

This instruction is used to explicitly trigger an exception. Exceptions occur when peripherals trigger an (unmasked) interrupt, when certain error occurs (unaligned memory access, checked overflow in certain instructions, etc...) or with commands which are meant to trigger an exception like SYSCALL or BREAK.

When an exception occurs the following takes place in the CPU:

  • The current value of the PC is stored in $cop0_14, the EPC (Exception PC) register19,

  • Record the cause of the exception (syscall, overflow, interrupt...) in $cop0_13, the CAUSE register,

  • Disable interrupts in $cop0_12 (SR),

  • Jump into the exception handler whose address is either 0x80000080 or 0xbfc00180 depending on the value of the BEV field (bit 22) in $cop0_12 (SR).

Unlike regular jumps and branches exceptions don't have a branch delay slot: the CPU jumps to the exception handler right after the current instruction.

The problem is that with my current architecture we fetch an instruction ahead of time to emulate the branch delay slot. When an exception is triggered we'd have to replace that instruction by the first one in the exception handler. It's possible of course but it's a bit messy and I think it was a bad idea after all.

Instead I'm going to use two variables for the PC: one will hold he current instruction and one will hold the "next PC". Normally next_pc is always 4 bytes ahead but when a branch occurs we'll set the PC to the instruction in the delay slot and next_pc to the branch target. In case of an exception however we'll set the PC to the exception handler address directly.

Let's change our CPU state to reflect that change:

/// CPU state
pub struct Cpu {
    /// The program counter register: points to the
    /// next instruction
    pc: u32,
    /// Next value for the PC, used to simulate the
    /// branch delay slot
    next_pc: u32,
    //...
}

impl Cpu {

    pub fn new(inter: Interconnect) -> Cpu {
        //...

        // Reset value for the PC, beginning of BIOS memory
        let pc = 0xbfc00000;

        Cpu {
            pc: pc,
            next_pc: pc.wrapping_add(4),
        //...
        }
    }

    //...
}

We can then (once again) rework run_next_instruction to use our PC pair:

impl Cpu {
    //...

    pub fn run_next_instruction(&mut self) {
        let pc = self.pc;

        // Fetch instruction at PC
        let instruction = Instruction(self.load32(pc));

        // Increment next PC to point to the next instruction.
        self.pc = self.next_pc;
        self.next_pc = self.next_pc.wrapping_add(4);

        // Execute the pending load (if any, otherwise it will load
        // `R0` which is a NOP). `set_reg` works only on `out_regs`
        // so this operation won't be visible by the next
        // instruction.
        let (reg, val) = self.load;
        self.set_reg(reg, val);

        // We reset the load to target register 0 for the next
        // instruction
        self.load = (RegisterIndex(0), 0);

        self.decode_and_execute(instruction);

        // Copy the output registers as input for the next instruction
        self.regs = self.out_regs;
    }
}

Then we just need to modify our branch and jump functions to set next_pc instead of pc to set the target address.

After that we can implement our exception infrastructure. On top of pc and next_pc we'll also need to store the address of the current instruction to store it in the EPC register ($cop0_14). We also need to add the CAUSE register to store the exception code:

/// CPU state
pub struct Cpu {
    //...

    /// Address of the instruction currently being executed. Used for
    /// setting the EPC in exceptions.
    current_pc: u32,
    /// Cop0 register 13: Cause Register
    cause: u32,
    /// Cop0 register 14: EPC
    epc: u32,
}

impl Cpu {
    //...

    pub fn run_next_instruction(&mut self) {
        // Fetch instruction at PC
        let instruction = Instruction(self.load32(self.pc));

        // Save the address of the current instruction to save in
        // `EPC` in case of an exception.
        self.current_pc = self.pc;

        //...
    }
}

Now that we've added the EPC and CAUSE registers for cop0 we can also add them to our implementation of MFC0:

impl Cpu {
    //...

    /// Move From Coprocessor 0
    fn op_mfc0(&mut self, instruction: Instruction) {
        let cpu_r = instruction.t();
        let cop_r = instruction.d().0;

        let v = match cop_r {
            12 => self.sr,
            13 => self.cause,
            14 => self.epc,
            _  =>
                panic!("Unhandled read from cop0r{}", cop_r),
        };

        self.load = (cpu_r, v)
    }
}

SYSCALL instruction

Finally we can implement our exception infrastructure and our SYSCALL opcode. I'm going to use a exception method that will be used from various exception sources:

impl Cpu {
    //...

    /// Trigger an exception
    fn exception(&mut self, cause: Exception) {
        // Exception handler address depends on the `BEV` bit:
        let handler = match self.sr & (1 << 22) != 0 {
            true  => 0xbfc00180,
            false => 0x80000080,
        };

        // Shift bits [5:0] of `SR` two places to the left.
        // Those bits are three pairs of Interrupt Enable/User
        // Mode bits behaving like a stack 3 entries deep.
        // Entering an exception pushes a pair of zeroes
        // by left shifting the stack which disables
        // interrupts and puts the CPU in kernel mode.
        // The original third entry is discarded (it's up
        // to the kernel to handle more than two recursive
        // exception levels).
        let mode = self.sr & 0x3f;
        self.sr &= ~0x3f;
        self.sr |= (mode << 2) & 0x3f;

        // Update `CAUSE` register with the exception code (bits
        // [6:2])
        self.cause = (cause as u32) << 2;

        // Save current instruction address in `EPC`
        self.epc = self.current_pc;

        // Exceptions don't have a branch delay, we jump directly
        // into the handler
        self.pc      = handler;
        self.next_pc = self.pc.wrapping_add(4);
    }

    /// System Call
    fn op_syscall(&mut self, _: Instruction) {
        self.exception(Exception::SysCall);
    }
}

/// Exception types (as stored in the `CAUSE` register)
enum Exception {
    /// System call (caused by the SYSCALL opcode)
    SysCall = 0x8,
}

Our op_syscall method ends up being a one liner. All the logic is in the generic exception method.

With this SYSCALL instruction the BIOS enters the exception handler. The NoCash specs tell us that we have to look at the contents of register $4 to know what the BIOS is supposed to do. In this case $4 contains 1 so it's supposed to run "EnterCriticalSection". This function is apparently supposed to disable all interrupts. Once this is done if everything works well the exception handler should return to the caller using an RFE instruction, let's continue and see if we find it as expected.

MTLO instruction

In the exception handler we stumble upon 0x00400013 which is "move to LO" (MTLO):

mtlo $2

As its name implies it just moves the value from a general purpose register into the LO register. Be careful though because the instruction encoding is different from MFLO:

impl Cpu {
    //...

    /// Move to LO
    fn op_mtlo(&mut self, instruction: Instruction) {
        let s = instruction.s();

        self.lo = self.reg(s);
    }
}

It might seem surprising to encounter this instruction: why would the BIOS want to move something into the LO register? After all this register is for the result of divisions and multiplications, you can't do anything with it besides reading it back.

The answer is that exception handlers are not supposed to restore all register values before returning to the "normal" code flow. The reason is obvious: exceptions can be triggered by asynchronous interrupts so they can basically happen at any time. If the exception handler changes the value of any register before giving back the control to the interrupted code it could lead to bogus behaviour.

For instance some game code could start a division and be interrupted before it reads the result in LO. Then the interrupt handler needs to compute an other division but does not restore the original value of the register before returning the control to the game. At that point the game reads LO expecting to get the result of its computation but instead it gets some garbage value left there by the handler. Obviously that would be problematic.

To avoid this the prologue of the exception handler saves the value of the registers it might modify (including HI and LO) to the RAM and then loads them back in the epilogue.

There are two exceptions though: registers $26 and $27' are reserved for the BIOS and are not preserved by the exception handler. In other words no code should use those registers when exceptions can occur because their content could change at any moment.

MTHI instruction

Unsurprisingly the MTLO is almost immediately followed by instruction 0x00400011 which is "move to HI" (MTHI):

mtlo $2

The implementation is almost identical to MTLO:

impl Cpu {
    //...

    /// Move to HI
    fn op_mthi(&mut self, instruction: Instruction) {
        let s = instruction.s();

        self.hi = self.reg(s);
    }
}

RFE intsruction

As expected once the exception handler is done it executes instruction 0x42000010 which is a coprocessor 0 opcode for "return from exception" (RFE):

rfe

All this instruction does is shift the Interrupt Enable/User Mode bits two places back to the right. This effectively undoes the opposite shift done when entering the handler and therefore puts the CPU back in the mode it was when the exception triggered (unless SR itself has been modified in the handler).

It does not reset the PC however, it's up to the BIOS to fetch the address in EPC, increment it by 4 to point at the next instruction and jump to it. The RFE instruction is typically in the final jump delay slot (and that's exactly what the Playstation BIOS handler does in this case).

The instruction encoding for RFE is a bit annoying: as usual we begin by checking bits [31:26] which are 0b010000 and introduce a coprocessor opcode. Then we check bits [25:21] to figure which one it is. For RFE it's 0b10000.

But it's not over! There can be multiple instructionts with this coprocessor encoding, although RFE is the only one implemented on the Playstation hardware (the others have to do with virtual memory). To make sure the requested instruction is the one we expect we must check bits [5:0] which must be equal to 0b010000:

impl Cpu {
    //...

    /// Coprocessor 0 opcode
    fn op_cop0(&mut self, instruction: Instruction) {
        match instruction.cop_opcode() {
            0b00000 => self.op_mfc0(instruction),
            0b00100 => self.op_mtc0(instruction),
            0b10000 => self.op_rfe(instruction),
            _       => panic!("unhandled cop0 instruction {}", instruction)
        }
    }

    /// Return From Exception
    fn op_rfe(&mut self, instruction: Instruction) {
        // There are other instructions with the same encoding but all
        // are virtual memory related and the Playstation doesn't
        // implement them. Still, let's make sure we're not running
        // buggy code.
        if instruction.0 & 0x3f != 0b010000 {
            panic!("Invalid cop0 instruction: {}", instruction);
        }

        // Restore the pre-exception mode by shifting the Interrupt
        // Enable/User Mode stack back to its original position.
        let mode = self.sr & 0x3f;
        self.sr &= !0x3f;
        self.sr |= mode >> 2;
    }
}

Exceptions and branch delay slots

In our current implementation when an exception occurs we store the current instruction's address in 'EPC'. That's correct in most cases but there's one exception in the MIPS archicture: when an exception occurs in a branch delay slot we must store the address of the branch instruction in EPC20.

Consider the following sequence where we have a 'SYSCALL' instruction in a 'JR' delay slot:

    jr $ra
    syscall

In this case the CPU will put the address of the jr $ra instruction in EPC before entering the exception handler. In order to signal this condition to the handler the CPU also sets bit 31 of the CAUSE register.

In order to implement this behaviour we first need to keep track of whether or we're in a branch delay slot. It's tempting to just check whether or not the next instruction is 4 bytes ahead of the current one but it's technically possible to branch 4 bytes ahead, even though it wouldn't be very useful. Instead I'm going to play it safe and add new variables:

pub struct Cpu {
    //...

    /// Set by the current instruction if a branch occured and the
    /// next instruction will be in the delay slot.
    branch: bool,
    /// Set if the current instruction executes in the delay slot
    delay_slot: bool,
}

impl Cpu {

    pub fn new(inter: Interconnect) -> Cpu {
        //...

        Cpu {
            //...
            branch:     false,
            delay_slot: false,
        }
    }

    pub fn run_next_instruction(&mut self) {
        //...
        let instruction = Instruction(self.load32(self.pc));

        // If the last instruction was a branch then we're in the
        // delay slot
        self.delay_slot = self.branch;
        self.branch     = false;

        self.decode_and_execute(instruction);

        //...
    }

    //...
}

Now we can simply modify (once again) all the branch and jump instructions to set self.branch = true. In the next cycle run_next_instruction will copy this variable to self.delay_slot.

Now that we keep track of delay slots we can modify our exception code to handle them accurately:

impl Cpu {
    //...

    /// Trigger an exception
    fn exception(&mut self, cause: Exception) {
        //...

        // Update `CAUSE` register with the exception code (bits
        // [6:2])
        self.cause = (cause as u32) << 2;

        // Save current instruction address in `EPC`
        self.epc = self.current_pc;

        if self.delay_slot {
            // When an exception occurs in a delay slot `EPC` points
            // to the branch instruction and bit 31 of `CAUSE` is set.
            self.epc = self.epc.wrapping_sub(4);
            self.cause |= 1 << 31;
        }

        //...
    }
}

With our exception handling infrastructure in place we can take the opportunity to review some exception conditions we've ignored so far and implement them accurately.

ADD and ADDI overflows

The ADD and ADDI opcodes generate an exception on signed overflow but in our current implementation is incomplete. We can use our exception method to handle them in full:

impl Cpu {
    //...

    /// Add and check for signed overflow
    fn op_add(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();
        let d = instruction.d();

        let s = self.reg(s) as i32;
        let t = self.reg(t) as i32;

        match s.checked_add(t) {
            Some(v) => self.set_reg(d, v as u32),
            None    => self.exception(Exception::Overflow),
        }
    }

    /// Add Immediate and check for signed overflow
    fn op_addi(&mut self, instruction: Instruction) {
        let i = instruction.imm_se() as i32;
        let t = instruction.t();
        let s = instruction.s();

        let s = self.reg(s) as i32;

        match s.checked_add(i) {
            Some(v) => self.set_reg(t, v as u32),
            None    => self.exception(Exception::Overflow),
        }
    }
}

/// Exception types (as stored in the `CAUSE` register)
enum Exception {
    //...

    /// Arithmetic overflow
    Overflow = 0xc,
}

Store and load alignment exceptions

When a load or store instruction targets a misaligned address (i.e. a word access address is not a multiple of 4 or a halfword access address is not a multiple of 2) the CPU is supposed to generate an exception:

impl Cpu {
    //...

    /// Load Word
    fn op_lw(&mut self, instruction: Instruction) {
        //...

        // Address must be 32bit aligned
        if addr % 4 == 0 {
            let v = self.load32(addr);

            // Put the load in the delay slot
            self.load = (t, v);
        } else {
            self.exception(Exception::LoadAddressError);
        }
    }

    /// Store Halfword
    fn op_sh(&mut self, instruction: Instruction) {
        //...

        // Address must be 16bit aligned
        if addr % 2 == 0 {
            self.store16(addr, v as u16);
        } else {
            self.exception(Exception::StoreAddressError);
        }
    }

    /// Store Word
    fn op_sw(&mut self, instruction: Instruction) {
        //...

        // Address must be 32bit aligned
        if addr % 4 == 0 {
            self.store32(addr, v);
        } else {
            self.exception(Exception::StoreAddressError);
        }
    }
}

/// Exception types (as stored in the `CAUSE` register)
enum Exception {
    //...

    /// Address error on load
    LoadAddressError = 0x4,
    /// Address error on store
    StoreAddressError = 0x5,
}

PC alignment exception

We should also generate an exception if the PC address is not correctly aligned when we attempt to fetch an instruction. This can happen if a JR or JALR instruction jumped to an address that was not 32bit aligned21:

impl Cpu {
    //...

    pub fn run_next_instruction(&mut self) {
        // Save the address of the current instruction to save in
        // `EPC` in case of an exception.
        self.current_pc = self.pc;

        if self.current_pc % 4 != 0 {
            // PC is not correctly aligned!
            self.exception(Exception::LoadAddressError);
            return;
        }

        // Fetch instruction at PC
        let instruction = Instruction(self.load32(self.pc));

        //...
    }
}

RAM 16bit store

If the exceptions are implemented correctly our next unhandled condition should be a SH targeting address 0x800dee24. This address is in the RAM so we just need to add 16bit store support for it:

impl Interconnect {
    //...

    /// Store 16bit halfword `val` into `addr`
    pub fn store16(&mut self, addr: u32, val: u16) {

        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::RAM.contains(abs_addr) {
            return self.ram.store16(offset, val);
        }

        //...
    }
}

And then in our RAM implementation:

impl Ram {
    //...

    /// Store the 16bit little endian halfword `val` into `offset`
    pub fn store16(&mut self, offset: u32, val: u16) {
        let offset = offset as usize;

        let b0 = val as u8;
        let b1 = (val >> 8) as u8;

        self.data[offset + 0] = b0;
        self.data[offset + 1] = b1;
    }
}

As always, make sure you get the endianess right.

DMA registers

We then stumble upon an unhandled load from address 0x1f8010f0. Looking at the memory map this is the "DMA control register". DMA stands for Direct Memory Access. This is a generic term which can mean different things on different architectures but the concept is always the same: it's used to move data between a peripheral and RAM without directly involving the CPU.

For instance if a game wants to load a texture to the GPU memory it can set up the DMA to do the copy instead of doing it from the CPU with a series of LW/SW. This is generally faster since the DMA is usually more efficient for moving data around and while it's working the CPU can do more interesting things22.

Since we still have some work to do on the CPU let's see if we can ignore the DMA access for now:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {
        //...

        if let Some(_) = map::DMA.contains(abs_addr) {
            println!("DMA read: {:08x}", abs_addr);
            return 0;
        }

        panic!("unhandled load32 at address {:08x}", addr);
    }
}

/// Direct Memory Access registers
pub const DMA: Range = Range(0x1f801080, 0x80);

You'll notice that I ignore all loads from any DMA register, not just the control. Let's hope we'll be able to keep the smoke screen up for a little longer.

Soon after that we encounter a SW targeting the DMA control register with the value 0x000b0000. This value configures the DMA SPU channel priority and enables it. This probably means the BIOS is getting ready to play some sound. Since we don't care about the SPU or the DMA at that point let's ignore those writes as well:

impl Interconnect {
    //...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(_) = map::DMA.contains(abs_addr) {
            println!("DMA write: {:08x} {:08x}", abs_addr, val);
            return;
        }

        panic!("unhandled store32 into address {:08x}: {:08x}", addr, val);
    }
}

Hopefuly we should be able to ignore the DMA for a while and keep focusing on the CPU.

LHU instruction

The next unhandled instruction is 0x961901ae which is "load halfword unsigned" (LHU):

lhu $25, 430($16)

It's the 16bit counterpart to LBU and it's our first 16bit load istruction:

impl Cpu {
    //...

    /// Load 16bit value from the memory
    fn load16(&self, addr: u32) -> u16 {
        self.inter.load16(addr)
    }

    /// Load Halfword Unsigned
    fn op_lhu(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        // Address must be 16bit aligned
        if addr % 2 == 0 {
            let v = self.load16(addr);

            // Put the load in the delay slot
            self.load = (t, v as u32);
        } else {
            self.exception(Exception::LoadAddressError);
        }
    }
}

We need to implement load16 in the interconnect. The current instruction attempts to load from 0x1f801dae which is the SPU status register. Let's lie once again and return 0 for SPU reads:

impl Interconnect {
    //...

    /// Load 16bit halfword at `addr`
    pub fn load16(&self, addr: u32) -> u16 {
        let abs_addr = map::mask_region(addr);

        if let Some(_) = map::SPU.contains(abs_addr) {
            println!("Unhandled read from SPU register {:08x}",
                     abs_addr);
            return 0;
        }

        panic!("unhandled load16 at address {:08x}", addr);
    }
}

If we continue the emulation we stumble on an other unhandled load16, this time at address 0x800dee24. This one is easy, it's RAM:

impl Interconnect {
    //...

    /// Load 16bit halfword at `addr`
    pub fn load16(&self, addr: u32) -> u16 {
        //...

        if let Some(offset) = map::RAM.contains(abs_addr) {
            return self.ram.load16(offset);
        }

        panic!("unhandled load16 at address {:08x}", addr);
    }

}

And in our RAM implementation:

impl Ram {
    //...

    /// Fetch the 16bit little endian halfword at `offset`
    pub fn load16(&self, offset: u32) -> u16 {
        let offset = offset as usize;

        let b0 = self.data[offset + 0] as u16;
        let b1 = self.data[offset + 1] as u16;

        b0 | (b1 << 8)
    }
}

SLLV instruction

After that we encounter 0x0078c804 which is "shift left logical variable" (SLLV):

sllv $25, $24, $3

It's like SLL except the shift amount is stored in a register instead of an immediate value.

The implementation is quite simple but there's something to consider: so far the shift amount was always a 5bit immediate value but this time it's a 32bit register. What happens when the register value is greater than 31?

It's also important to figure out because shifting out of range is undefined in Rust (and in C) so we have to be careful not to introduce weird undefined behavior in our emulator.

Shifting by more than 31 places would mean shifting the 32bit value completely out of range. Intuitively you might say that it sets it to 0 (all significant bits get shifted outside the register) but it turns out it's not accurate.

In reality on the R3000 CPU the shift amount is always implicitly masked with 0x1f to only keep the low 5 bits. It means that a shift amount of 32 behaves like 0 (i.e. it's a NOP) while 130 behaves like 2:

impl Cpu {
    //...

    /// Shift Left Logical Variable
    fn op_sllv(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        // Shift amount is truncated to 5 bits
        let v = self.reg(t) << (self.reg(s) & 0x1f);

        self.set_reg(d, v);
    }
}

LH instruction

We implemented LHU not long ago and now we meet 0x87a30018 which is "load halfword" (LH):

lh $3, 24($29)

It's implemented like LHU but it sign-extends the 16bit value to fit the 32bit target register:

impl Cpu {
    //...

    /// Load Halfword (signed)
    fn op_lh(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        // Cast as i16 to force sign extension
        let v = self.load16(addr) as i16;

        // Put the load in the delay slot
        self.load = (t, v as u32);
    }
}

NOR instruction

After that we stumble upon 0x0040c827 which is "bitwise not or" (NOR):

nor $25, $2, $zero

It simply computes a bitwise OR between two registers and then complements the result before storing it in the destination register23:

impl Cpu {
    //...

    /// Bitwise Not Or
    fn op_nor(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let v = !(self.reg(s) | self.reg(t));

        self.set_reg(d, v);
    }
}

SRAV instruction

The next unhandled instruction is 0x00e84007 which encodes "shift right arithmetic variable" (SRAV):

srav $8, $8, $7

We've already implemented SRA and SLLV so this one shouldn't give us any trouble:

impl Cpu {
    //...

    /// Shift Right Arithmetic Variable
    fn op_srav(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        // Shift amount is truncated to 5 bits
        let v = (self.reg(t) as i32) >> (self.reg(s) & 0x1f);

        self.set_reg(d, v as u32);
    }
}

SRLV instruction

We finally encounter the last shift instruction: 0x01a52806 is "shift right logical variable" (SRLV):

srlv $5, $5, $13

It's implemented like SRAV without sign extension (or like SRL with a register holding the shift amount, if you prefer):

impl Cpu {
    //...

    /// Shift Right Logical Variable
    fn op_srlv(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        // Shift amount is truncated to 5 bits
        let v = self.reg(t) >> (self.reg(s) & 0x1f);

        self.set_reg(d, v);
    }
}

MULTU instruction

The next unhandled instruction is 0x01240019 which encodes "multiply unsigned" (MULTU):

multu $9, $4

It's our first multiplication opcode. The CPU does the multiplication using 64bit arithmetics and store the result across the HI and LO registers:

impl Cpu {
    //...

    /// Multiply Unsigned
    fn op_multu(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();

        let a = self.reg(s) as u64;
        let b = self.reg(t) as u64;

        let v = a * b;

        self.hi = (v >> 32) as u32;
        self.lo = v as u32;
    }
}

The timings of the multiplication instructions are similar to the divisions: they run in the background and only stall the CPU if it attempts to read the LO or HI registers before it's done. Since we don't implement accurate CPU timings I choose to ignore that for now.

GPU registers

Our next stop will be an unhandled LW at address 0x1f801814. This register is GPUSTAT when read and GP1 when written. In other words GPUSTAT is read only while GP1 is write only and they share the same address. Why not.

GPUSTAT contains a whole bunch of information about the GPU status. Things like the display's resolution and color depth, interlacing, DMA channel status and more.

It seems we're entering to the display initialization code, we might soon be pushing our first pixels to the screen! Boot logo, here we come.

Well, let's not get ahead of ourselves, for now we have zero GPU emulation code so we're going to use the usual deception and have the BIOS read zeroes when it attempts to access the GPU register space. That's easy, there are only two registers in the GPU24:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {
        //...

        if let Some(offset) = map::GPU.contains(abs_addr) {
            println!("GPU read {}", offset);
            return 0;
        }

        panic!("unhandled load32 at address {:08x}", addr);
    }
}

GP0: Draw Mode Setting command

Very soon after that we get an unhandled write32 at address 0x1f801810. This is the other GPU register address. This one is GP0 for writing and it's used to queue commands.

We'll study the GPU more closely soon but for now it suffices to say that it's programmed differently from the other peripherals we've seen so far: instead of having dedicated registers for the various function, the CPU (or DMA) queues commands in one of the two ports (GP0 and GP1) which behave like FIFOs. The GPU then executes the commands one after an other.

Commands include drawing triangles, lines and sprites with various attributes but also things like interrupt management and display configuration.

In order to interpret a GPU command we must first see to which port it was posted (GP0 in this case). Then we must look at the value: 0xe1001000 here. The high byte (0xe1) is the "opcode", the remaining 24bits are parameters whose meaning depends on the command.

This particular opcode is "Draw Mode Setting". It mostly sets a bunch of texture-related parameters. In this particular instance only bit 12 is set which activates "Textured Rectangle X-Flip". Not exactly obvious why the BIOS is doing this right now but I guess we'll figure out that soon.

For now we're still working on our CPU, so let's just ignore writes to the GPU ports and hope we can get away with it:

impl Interconnect {
    //...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(offset) = map::GPU.contains(abs_addr) {
            println!("GPU write {}: {:08x}", offset, val);
            return;
        }

        panic!("unhandled store32 into address {:08x}: {:08x}", addr, val);
    }
}

Interrupt Control 16bit access

Unfortunately we don't go very far, the BIOS then wants to make a 16bit read at the Interrupt Mask address. So far we've only implemented 32bit access so let's add halfword support:

impl Interconnect {
    //...

    /// Load 16bit halfword at `addr`
    pub fn load16(&self, addr: u32) -> u16 {
        //...

        if let Some(offset) = map::IRQ_CONTROL.contains(abs_addr) {
            println!("IRQ control read {:x}", offset);
            return 0;
        }
    }
}

Unsurprisingly it's followed by a 16 bit write to the same address with the value 1. This means that the BIOS wants to use the first interrupt which is the vertical blanking interrupt generated by the GPU's video output. As usual let's ignore that:

impl Interconnect {
    //...

    /// Store 16bit halfword `val` into `addr`
    pub fn store16(&mut self, addr: u32, val: u16) {
        //...

        if let Some(offset) = map::IRQ_CONTROL.contains(abs_addr) {
            println!("IRQ control write {:x}, {:04x}", offset, val);
            return;
        }
    }
}

Timer registers 32bit access

After that we get an unhandled 32bit access to the timers range.

This time the BIOS wants to store 0xffffffff at 0x1f801118 which is the counter target value for timer 1. When the counter reaches that value it goes back to 0 and optionally generates an interrupt. The counter is only 16bit wide though so this write would actually set the target value to 0xffff and the upper 16bits are ignored.

Let's add our usual placeholder code:

impl Interconnect {
    //...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(offset) = map::TIMERS.contains(abs_addr) {
            println!("Unhandled write to timer register {:x}: {:08x}",
                     offset, val);
            return;
        }
    }
}

After that the BIOS writes 0x148 to 0x1f801114 which sets the timer 1 mode. Bit 0x8 clears the counter (resets it to 0), bit 0x40 sets the timer interrupt to repeat mode which means that it will fire periodically when the counter reaches the target. Finally bit 0x100 sets the clock source as "horizontal blanking". It means that the timer increments when the display reaches the horizontal blanking period.

This doesn't set bit 0x10 however which would actually enable the interrupt. And it hasn't attempted to unmask the interrupt in the Interrupt Mask register either anyway. Not sure where the BIOS is going with this.

After that the BIOS tries to change the value of the Interrupt Mask and enables interrupt 0x8 which is the DMA's.

GPUSTAT "DMA ready" field

At this point the BIOS enters an infinite loop: it reads the GPUSTAT register again and again. Obviously it's waiting for something to happen but since we only ever return 0 it deadlocks.

If we disassemble that loop the code looks like this (it's in the BIOS at address 0xbfc04190):

lw $8, 0($6)     /* Here $6 is equal to 0x1f801814 (GPUSTAT) */
nop              /* load delay slot */
and $9, $8, $4   /* Here $4 contains 0x10000000 */
beq $9, $0, -44  /* Loop back if $9 is zero */

There are more things in the loop but that's the important part. We can see that the BIOS loads GPUSTAT, masks bit 28 and loops if it's 0.

If we look at the specs we can see that bit 28 of GPUSTAT tells if the GPU is ready to receive a DMA block. So it seems that the BIOS is polling this bit in GPUSTAT because it's about to initiate a DMA transfer between the RAM and the GPU.

Let's modify our GPUSTAT handling code to return 0x10000000 when read:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {
        //...

        if let Some(offset) = map::GPU.contains(abs_addr) {
        println!("GPU read {}", offset);
            return match offset {
                // GPUSTAT: set bit 28 to signal that the GPU is ready
                // to receive DMA blocks
                4 => 0x10000000,
                _ => 0,
            }
        }

        //...
    }
}

This lets the BIOS continue the execution a little further.

XOR instruction

We then encounter an unhandled instruction: 0x0303c826 which encodes an "exclusive or" (XOR):

xor $25, $24, $3

We can implement it by copying the OR method and replacing the | operator with ^:

mod Cpu {
    //...

    /// Bitwise Exclusive Or
    fn op_xor(&mut self, instruction: Instruction) {
        let d = instruction.d();
        let s = instruction.s();
        let t = instruction.t();

        let v = self.reg(s) ^ self.reg(t);

        self.set_reg(d, v);
    }
}

With this instruction implemented the BIOS then goes on to write a bunch of DMA registers and then gets stuck in an other infinite loop, polling GPUSTAT once again.

We could look at what the BIOS is doing once again to try and figure out the right value to return to let it continue but that would be a bit pointless at that point. We've almost implemented all the CPU instructions anyway and we've reach the part of the BIOS where the bootup logo is drawn. We need to implement the DMA to send the commands to the GPU and then emulate the GPU itself to accept those commands and draw on the screen.

Before we move on though let's implement the handful of CPU opcodes we haven't yet encountered. At this point we've implemented 48 opcodes and 19 are remaining. Fortunately most of those are variations of instructions we've already implemented so let's get this over with.

BREAK instructions

BREAK triggers an exception like SYSCALL but it sets code 9 in the CAUSE register. This instruction is generally meant to create software breakpoints in code for debugging purposes but I imagine some games might abuse it for other purposes.

This instruction is encoded by setting bits [31:26] of the instruction to zero and bits [5:0] to 0xd.

impl Cpu {
    //...

    /// Break
    fn op_break(&mut self, _: Instruction) {
        self.exception(Exception::Break);
    }
}

/// Exception types (as stored in the `CAUSE` register)
enum Exception {
    //...

    /// Breakpoint (caused by the BREAK opcode)
    Break = 0x9,
}

MULT instruction

"Multiply" (MULT) is simply the signed counterpart to MULTU. It multiplies its operands using 64bit signed arithmetics and stores stores the result in HI and LO.

This instruction is encoded by setting bits [31:26] of the instruction to zero and bits [5:0] to 0x18.

impl Cpu {
    //...

    /// Multiply (signed)
    fn op_mult(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();

        let a = (self.reg(s) as i32) as i64;
        let b = (self.reg(t) as i32) as i64;

        let v = (a * b) as u64;

        self.hi = (v >> 32) as u32;
        self.lo = v as u32;
    }
}

All those casts are a bit ugly but they're necessary to get the proper sign extension.

SUB instruction

"Substract" (SUB) is like SUBU but with signed arithmetics and it triggers an exception on signed overflow.

This instruction is encoded by setting bits [31:26] of the instruction to zero and bits [5:0] to 0x22.

impl Cpu {
    //...

    /// Substract and check for signed overflow
    fn op_sub(&mut self, instruction: Instruction) {
        let s = instruction.s();
        let t = instruction.t();
        let d = instruction.d();

        let s = self.reg(s) as i32;
        let t = self.reg(t) as i32;

        match s.checked_sub(t) {
            Some(v) => self.set_reg(d, v as u32),
            None    => self.exception(Exception::Overflow),
        }
    }
}

XORI instruction

"Exclusive or immediate" (XORI) is the version of the XOR instruction taking an immediate operand. We can implement it by taking the code for ORI and changing the operator.

This instruction is encoded by setting bits [31:26] of the instruction to 0xe.

impl Cpu {
    //...

    /// Bitwise eXclusive Or Immediate
    fn op_xori(&mut self, instruction: Instruction) {
        let i = instruction.imm();
        let t = instruction.t();
        let s = instruction.s();

        let v = self.reg(s) ^ i;

        self.set_reg(t, v);
    }
}

Cop1, cop2 and cop3 opcodes

We've implemented cop0 instructions (MTC0, RFE etc...). The three other coprocessors can also have dedicated opcodes. On the Playstation however cop1 and cop3 are not used so any instruction targeting them will trigger an exception with code 0xb to signal a coprocessor error..

Cop1 and cop3 opcodes are encoded by setting bits [31:26] of the instruction to 0x11 and 0x13 respectively.

impl Cpu {
    //...

    /// Coprocessor 1 opcode (does not exist on the Playstation)
    fn op_cop1(&mut self, _: Instruction) {
        self.exception(Exception::CoprocessorError);
    }

    /// Coprocessor 3 opcode (does not exist on the Playstation)
    fn op_cop3(&mut self, _: Instruction) {
        self.exception(Exception::CoprocessorError);
    }
}

/// Exception types (as stored in the `CAUSE` register)
enum Exception {
    //...

    /// Unsupported coprocessor operation
    CoprocessorError = 0xb,
}

Cop2 however is implemented on the Playstation: it's the Geometry Transform Engine (GTE). We don't need to implement the GTE for now so let's just add a dummy implementation that will crash the emulator if a GTE instruction is encountered.

Cop opcodes are encoded by setting bits [31:26] of the instruction to 0x12.

impl Cpu {
    //...

    /// Coprocessor 2 opcode (GTE)
    fn op_cop2(&mut self, instruction: Instruction) {
        panic!("unhandled GTE instruction: {}", instruction);
    }
}

Non-aligned reads

So far we've seen that all CPU memory transactions had to be properly aligned or they would trigger an exception. The MIPS instruction set does however have limited support for unaligned access. For unaligned reads it provides "load word left" (LWL) and "load word right" (LWR).

Both those instruction work by fetching the aligned word containing the addressed byte and then shifting the value to only update the correct portion of the target register.

Therefore in order to load a single unaligned word you need to run a both a LWL and a LWR in sequence (the order doesn't matter) to fetch the 32bits.

The behaviour of both these instructions changes depending on whether the CPU is running in big or little-endian mode. Since the PSX runs exclusively in little endian we can ignore the other case.

For a little endian architecture and assuming $2 contains the potentially unaligned load address the sequence would look like this:

/* Load right part of potentially unaligned word at $2 */
lwr $1, 0($2)
/* Load left part of potentially unaligned word at $2 */
lwl $1, 3($2)

After this sequence $1 contains the 4byte little endian value at the address stored in $2 regardless of its alignment.

You can see that the LWL instruction is given an offset of 3. If the address was correctly aligned we remain within the same aligned 32bit word, otherwise we've moved to the next one.

Okay, that might sound a bit complicated, hopefully everything will be clearer when we see the code of the implementation.

Before that however it's important to note a specificity of these unaligned word instructions: you'll notice that in my asm snippet above I run the two instructions back-to-back without delay. That's because those instructions can merge their data with that of a pending load without having to wait for the load to finish.

For other load instructions it wouldn't make a lot of sense (why would you want to load twice to the same target register without doing anything with the first value?) but since LWL and LWR are meant to be used together to load a single value it makes sense to spare a cycle there25.

LWL instruction

The "load word left" (LWL) opcode is encoded by setting bits [31:26] of the instruction to 0x22.

impl Cpu {
    //...

    /// Load Word Left (little-endian only implementation)
    fn op_lwl(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        // This instruction bypasses the load delay restriction: this
        // instruction will merge the new contents with the value
        // currently being loaded if need be.
        let cur_v = self.out_regs[t.0 as usize];

        // Next we load the *aligned* word containing the first
        // addressed byte
        let aligned_addr = addr & !3;
        let aligned_word = self.load32(aligned_addr);

        // Depending on the address alignment we fetch the 1, 2, 3 or
        // 4 *most* significant bytes and put them in the target
        // register.
        let v = match addr & 3 {
            0 => (cur_v & 0x00ffffff) | (aligned_word << 24),
            1 => (cur_v & 0x0000ffff) | (aligned_word << 16),
            2 => (cur_v & 0x000000ff) | (aligned_word << 8),
            3 => (cur_v & 0x00000000) | (aligned_word << 0),
            _ => unreachable!(),
        };

        // Put the load in the delay slot
        self.load = (t, v);
    }
}

Hopefully the comments are clear enough to follow what the code is doing. You can see that LWL updates one, two, three or all four bytes in the target register depending on the address alignment.

Note the direct reference to self.out_regs instead of our usual helper to make sure we ignore the load delay when the two instructions are used in sequence.

LWR instruction

The "load word right" (LWR) opcode is encoded by setting bits [31:26] of the instruction to 0x26. The implementation is very similar to LWL with a few key changes:

impl Cpu {
    //...

    /// Load Word Right (little-endian only implementation)
    fn op_lwr(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        // This instruction bypasses the load delay restriction: this
        // instruction will merge the new contents with the value
        // currently being loaded if need be.
        let cur_v = self.out_regs[t.0 as usize];

        // Next we load the *aligned* word containing the first
        // addressed byte
        let aligned_addr = addr & !3;
        let aligned_word = self.load32(aligned_addr);

        // Depending on the address alignment we fetch the 1, 2, 3 or
        // 4 *least* significant bytes and put them in the target
        // register.
        let v = match addr & 3 {
            0 => (cur_v & 0x00000000) | (aligned_word >> 0),
            1 => (cur_v & 0xff000000) | (aligned_word >> 8),
            2 => (cur_v & 0xffff0000) | (aligned_word >> 16),
            3 => (cur_v & 0xffffff00) | (aligned_word >> 24),
            _ => unreachable!(),
        };

        // Put the load in the delay slot
        self.load = (t, v);
    }
}

You can see that like LWL we update from one to four bytes depending on the alignment, however this time it's the least significant bytes.

Non-aligned writes

Naturally the MIPS instruction set doesn't only support loading non-aligned words, it can also store them using "store word left" (SWL) and "store word right" (SWR).

The concept is the same: to store a 32bit integer at an unaligned access one would call SWR and SWL in sequence to update the entire word.

SWL instruction

The "store word left" (SWL) opcode is encoded by setting bits [31:26] of the instruction to 0x2a. Since we only update part of the aligned target word we have to fetch its value before we can modify it and store it back again:

impl Cpu {
    //...

    /// Store Word Left (little-endian only implementation)
    fn op_swl(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);
        let v    = self.reg(t);

        let aligned_addr = addr & !3;
        // Load the current value for the aligned word at the target
        // address
        let cur_mem = self.load32(aligned_addr);

        let mem = match addr & 3 {
            0 => (cur_mem & 0xffffff00) | (v >> 24),
            1 => (cur_mem & 0xffff0000) | (v >> 16),
            2 => (cur_mem & 0xff000000) | (v >> 8),
            3 => (cur_mem & 0x00000000) | (v >> 0),
            _ => unreachable!(),
        };

        self.store32(aligned_addr, mem);
    }
}

SWR instruction

The "store word right" (SWR) opcode is encoded by setting bits [31:26] of the instruction to 0x2e. It's very similar to SWL except for a a few key differences:

impl Cpu {
    //...

    /// Store Word Right (little-endian only implementation)
    fn op_swr(&mut self, instruction: Instruction) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);
        let v    = self.reg(t);

        let aligned_addr = addr & !3;
        // Load the current value for the aligned word at the target
        // address
        let cur_mem = self.load32(aligned_addr);

        let mem = match addr & 3 {
            0 => (cur_mem & 0x00000000) | (v << 0),
            1 => (cur_mem & 0x000000ff) | (v << 8),
            2 => (cur_mem & 0x0000ffff) | (v << 16),
            3 => (cur_mem & 0x00ffffff) | (v << 24),
            _ => unreachable!(),
        };

        self.store32(aligned_addr, mem);
    }
}

Coprocessor loads and stores

We've seen that MTC0 and MFC0 can be used to move data between the general purpose registers and the coprocessor 0. That means that if you want to load or store a cop0 register value from or to the memory we have to pass through the CPU general purpose registers.

The coprocessor 2 (the GTE) supports an additional, more optimized way to do this: "load word to coprocessor 2" (LWC2) and "store word from coprocessor 2"(SWC2). Those instructions respectively load and store a cop2 register directly from and to memory.

Since the other coprocessors don't support these opcodes they generate a "coprocessor error" exception when they're encountered.

LWCn instructions

"Load word coprocessor n" (LWCn) opcodes are encoded by setting bits [31:26] of the instruction 0x30 + n.

impl Cpu {
    //...

    /// Load Word in Coprocessor 0
    fn op_lwc0(&mut self, _: Instruction) {
        // Not supported by this coprocessor
        self.exception(Exception::CoprocessorError);
    }

    /// Load Word in Coprocessor 1
    fn op_lwc1(&mut self, _: Instruction) {
        // Not supported by this coprocessor
        self.exception(Exception::CoprocessorError);
    }

    /// Load Word in Coprocessor 2
    fn op_lwc2(&mut self, instruction: Instruction) {
        panic!("unhandled GTE LWC: {}", instruction);
    }

    /// Load Word in Coprocessor 3
    fn op_lwc3(&mut self, _: Instruction) {
        // Not supported by this coprocessor
        self.exception(Exception::CoprocessorError);
    }
}

SWCn instructions

"Store word coprocessor n" (SWCn) opcodes are encoded by setting bits [31:26] of the instruction 0x38 + n.

impl Cpu {
    //...

    /// Store Word in Coprocessor 0
    fn op_swc0(&mut self, _: Instruction) {
        // Not supported by this coprocessor
        self.exception(Exception::CoprocessorError);
    }

    /// Store Word in Coprocessor 1
    fn op_swc1(&mut self, _: Instruction) {
        // Not supported by this coprocessor
        self.exception(Exception::CoprocessorError);
    }

    /// Store Word in Coprocessor 2
    fn op_swc2(&mut self, instruction: Instruction) {
        panic!("unhandled GTE SWC: {}", instruction);
    }

    /// Store Word in Coprocessor 3
    fn op_swc3(&mut self, _: Instruction) {
        // Not supported by this coprocessor
        self.exception(Exception::CoprocessorError);
    }
}

Illegal instructions

We now have implemented (at least partially) all the CPU instructions! That doesn't mean that our CPU is complete: we still have to implement the GTE coprocessor and the cache for instance but that will wait for later.

We can also take this opportunity to implement illegal instructions. For instance instruction 0x50000000 doesn't encode any valid instruction on the Playstation CPU and is therefore illegal.

Illegal instructions simply trigger an exception on the CPU with the code 0xa in the CAUSE register.

Knowing that we can complete our decode_and_execute function, here's what it should look like with all instructions implemented:

impl Cpu {
    //...

    /// Decode `instruction`'s opcode and run the function
    fn decode_and_execute(&mut self, instruction: Instruction) {
        match instruction.function() {
            0b000000 => match instruction.subfunction() {
                0b000000 => self.op_sll(instruction),
                0b000010 => self.op_srl(instruction),
                0b000011 => self.op_sra(instruction),
                0b000100 => self.op_sllv(instruction),
                0b000110 => self.op_srlv(instruction),
                0b000111 => self.op_srav(instruction),
                0b001000 => self.op_jr(instruction),
                0b001001 => self.op_jalr(instruction),
                0b001100 => self.op_syscall(instruction),
                0b001101 => self.op_break(instruction),
                0b010000 => self.op_mfhi(instruction),
                0b010001 => self.op_mthi(instruction),
                0b010010 => self.op_mflo(instruction),
                0b010011 => self.op_mtlo(instruction),
                0b011000 => self.op_mult(instruction),
                0b011001 => self.op_multu(instruction),
                0b011010 => self.op_div(instruction),
                0b011011 => self.op_divu(instruction),
                0b100000 => self.op_add(instruction),
                0b100001 => self.op_addu(instruction),
                0b100010 => self.op_sub(instruction),
                0b100011 => self.op_subu(instruction),
                0b100100 => self.op_and(instruction),
                0b100101 => self.op_or(instruction),
                0b100110 => self.op_xor(instruction),
                0b100111 => self.op_nor(instruction),
                0b101010 => self.op_slt(instruction),
                0b101011 => self.op_sltu(instruction),
                _        => self.op_illegal(instruction),
            },
            0b000001 => self.op_bxx(instruction),
            0b000010 => self.op_j(instruction),
            0b000011 => self.op_jal(instruction),
            0b000100 => self.op_beq(instruction),
            0b000101 => self.op_bne(instruction),
            0b000110 => self.op_blez(instruction),
            0b000111 => self.op_bgtz(instruction),
            0b001000 => self.op_addi(instruction),
            0b001001 => self.op_addiu(instruction),
            0b001010 => self.op_slti(instruction),
            0b001011 => self.op_sltiu(instruction),
            0b001100 => self.op_andi(instruction),
            0b001101 => self.op_ori(instruction),
            0b001110 => self.op_xori(instruction),
            0b001111 => self.op_lui(instruction),
            0b010000 => self.op_cop0(instruction),
            0b010001 => self.op_cop1(instruction),
            0b010010 => self.op_cop2(instruction),
            0b010011 => self.op_cop3(instruction),
            0b100000 => self.op_lb(instruction),
            0b100001 => self.op_lh(instruction),
            0b100010 => self.op_lwl(instruction),
            0b100011 => self.op_lw(instruction),
            0b100100 => self.op_lbu(instruction),
            0b100101 => self.op_lhu(instruction),
            0b100110 => self.op_lwr(instruction),
            0b101000 => self.op_sb(instruction),
            0b101001 => self.op_sh(instruction),
            0b101010 => self.op_swl(instruction),
            0b101011 => self.op_sw(instruction),
            0b101110 => self.op_swr(instruction),
            0b110000 => self.op_lwc0(instruction),
            0b110001 => self.op_lwc1(instruction),
            0b110010 => self.op_lwc2(instruction),
            0b110011 => self.op_lwc3(instruction),
            0b111000 => self.op_swc0(instruction),
            0b111001 => self.op_swc1(instruction),
            0b111010 => self.op_swc2(instruction),
            0b111011 => self.op_swc3(instruction),
            _        => self.op_illegal(instruction),
        }
    }

    /// Illegal instruction
    fn op_illegal(&mut self, instruction: Instruction) {
        println!("Illegal instruction {}!", instruction);
        self.exception(Exception::IllegalInstruction);
    }
}

/// Exception types (as stored in the `CAUSE` register)
enum Exception {
    //...

   /// CPU encountered an unknown instruction
    IllegalInstruction = 0xa,
}

That's quite a milestone but it's only the beginning. While implementing all those instructions and stepping through the BIOS we've seen that it tries to use many peripherals: the SPU, the timers, the DMA and the GPU in particular.

At this point my first objective is to display an image to the screen so I want to start implementing the GPU as soon as possible. But we won't be able to do anything useful with the GPU without the DMA, so let's start with that.

The DMA: Ordering tables and the GPU

The DMA is used to move data back and forth between the RAM and a peripheral (GPU, CDROM, SPU, etc...). The CPU could achieve the same results by a series of loads/stores but the DMA is generally much faster.

The Playstation DMA controller lives alongside the CPU and shares the memory BUS with it. It means that while the DMA is busy transferring data the CPU is stopped: only one device can access the BUS at a given time. The DMA can only copy data between the RAM and a device, not directly between two devices. For instance you can't copy a texture from the CDROM directly into the GPU with the DMA, you first have to make a transfer from the CDROM into the main RAM and then a 2nd one between the RAM and the GPU.

There are 7 DMA channels on the Playstation:

  • Channel 0 is connected to the Media Decoder input

  • Channel 1 is connected to the Media Decoder output

  • Channel 2 is connected to the GPU

  • Channel 3 is connected to the CDROM drive

  • Channel 4 is connected to the SPU

  • Channel 5 is connected to the extension port

  • Channel 6 is only connected to the RAM and is used to clear an "ordering table"

Implementing complete and accurate DMA support can be quite tricky. The main problem is that in certain modes the DMA sporadically gives back the control to the CPU. For instance while the GPU is busy processing a command and won't accept any new input the DMA has to wait. Instead of wasting time it gives back control to the CPU to give it the opportunity to do something else.

In order to emulate this behaviour correctly we need to emulate the GPU command FIFO, DMA timings and CPU timings correctly. Then we need to setup the state machine to switch between the CPU and DMA when needed. That would require quite some work to get right and we only have the BIOS boot logo to test it at this point.

To avoid having to implement all that we're going to make a simplifying assumption for now: when the DMA runs it does all the transfer at once without giving back control to the CPU. This won't be exactly accurate but it should suffice to run the BIOS and hopefully some games.

The reason I feel confident doing this simplification is that PCSX-R seems to do it that way and it can run quite many games, although some comments hint that it breaks with certain titles and it uses some hacks to improve compatibility. Mednafen on the other hand implements a much accurate DMA and actually emulates the DMA giving back the control to the CPU in certain situations, we'll probably want to do something similar later on.

For now let's take a few steps back and revisit all the DMA register reads and writes done by the BIOS so that we can emulate them correctly.

DMA Control register

If we look at the DMA register access in our emulator we can see that the first one is a read at 0x1f8010f0 which is offset 0x70 in the DMA register range. This register is the DMA Control register which sets the priority of each channel and whether or not they're enabled. We don't really care about the port priorities since we'll be running each channel transaction entirely at once (so we'll never have two channels active at once for now) and I'm not entirely sure what disabling a channel does (does it prevent accessing the channel's register? What happens if a game attemps to start a disabled channel?). For now we'll just implement a dummy register read/write access.

We're going to wrap the DMA code in a dedicated struct to keep our code tidy. The Nocash spec says that the reset value for the control register is 0x07654321 which means that all channels are disabled and the priority increases with the channel number:

/// Direct Memory Access
pub struct Dma {
    /// DMA control register
    control: u32,
}

impl Dma {
    pub fn new() -> Dma {
        Dma {
            // Reset value taken from the Nocash PSX spec
            control: 0x07654321,
        }
    }

    /// Retrieve the value of the control register
    pub fn control(&self) -> u32 {
        self.control
    }
}

We can then add an instance of this struct Dma in our interconnect and glue our new control method when the register is accessed:

/// Global interconnect
pub struct Interconnect {
    //...

    /// DMA registers
    dma: Dma,
}

impl Interconnect {
    pub fn new(bios: Bios) -> Interconnect {
        Interconnect {
            //...

            dma:  Dma::new(),
        }
    }

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {
        //...

        if let Some(offset) = map::DMA.contains(abs_addr) {
            return self.dma_reg(offset);
        }
    }

    /// DMA register read
    fn dma_reg(&self, offset: u32) -> u32 {
        match offset {
            0x70 => self.dma.control(),
            _    => panic!("unhandled DMA access")
        }
    }
}

The BIOS then writes back 0x076f4321 to the the same register which means that it enables channel 4 (the SPU) and sets it priority to 7. Let's implement write support for the control register:

impl Interconnect {
    //..

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(offset) = map::DMA.contains(abs_addr) {
            return self.set_dma_reg(offset, val);
        }
    }

    /// DMA register write
    fn set_dma_reg(&mut self, offset: u32, val: u32) {
        match offset {
            0x70 => self.dma.set_control(val),
            _    => panic!("unhandled DMA write access")
        }
    }
}

impl Dma {
    //...

    /// Set the value of the control register
    pub fn set_control(&mut self, val: u32) {
        self.control = val
    }
}

Not very exciting so far.

DMA Interrupt register

After that the BIOS writes 0 to the DMA register at offset 0x74. This one is the DMA Interrupt register and as its name implies it is used to configure and acknowledge the DMA interrupts.

Bits [22:16] enable the interrupt individually for each channel. Bit 23 is the master enable: if it's 0 then no interrupt is generated by any channel. Bit 15 on the other hand forces the generation of an interrupt continuously when it's set.

When a channel generates an interrupt it needs to be acknowledged to reset it to an inactive status. This is done by writing 1 to bits [24:30] (one bit per channel). Finally bits [5:0] are read/write but I don't know what they do, we'll juts preserve them and hope they're not important.

While we're at it we'll also implement reading this register. When read bits [24:30] contain the IRQ status for each channel and bit 31 says if an interrupt is currently active. The other fields retain the last value written to them. In code it looks like this26:

/// Direct Memory Access
pub struct Dma {
    //...

    /// master IRQ enable
    irq_en: bool,
    /// IRQ enable for individual channels
    channel_irq_en: u8,
    /// IRQ flags for individual channels
    channel_irq_flags: u8,
    /// When set the interrupt is active unconditionally (even if
    /// `irq_en` is false)
    force_irq: bool,
    /// Bits [0:5] of the interrupt registers are RW but I don't know
    /// what they're supposed to do so I just store them and send them
    /// back untouched on reads
    irq_dummy: u8,
}


impl Dma {
    //...

    /// Return the status of the DMA interrupt
    fn irq(&self) -> bool {
        let channel_irq = self.channel_irq_flags & self.channel_irq_en;

        self.force_irq || (self.irq_en && channel_irq != 0)
    }

    /// Retrieve the value of the interrupt register
    pub fn interrupt(&self) -> u32 {
        let mut r = 0;

        r |= self.irq_dummy as u32;
        r |= (self.force_irq as u32) << 15;
        r |= (self.channel_irq_en as u32) << 16;
        r |= (self.irq_en as u32) << 23;
        r |= (self.channel_irq_flags as u32) << 24;
        r |= (self.irq() as u32) << 31;

        r
    }

    /// Set the value of the interrupt register
    pub fn set_interrupt(&mut self, val: u32) {
        // Unknown what bits [5:0] do
        self.irq_dummy = (val & 0x3f) as u8;

        self.force_irq = (val >> 15) & 1 != 0;

        self.channel_irq_en = ((val >> 16) & 0x7f) as u8;

        self.irq_en = (val >> 23) & 1 != 0;

        // Writing 1 to a flag resets it
        let ack = ((val >> 24) & 0x3f) as u8;
        self.channel_irq_flags &= !ack;
    }
}

Then you'll have to plug those accessor methods in the interconnect as usual27.

DMA Channel Control register

Field bits Description


0 Transfer direction: RAM-to-device(0) or device-to-RAM(1) 1 Address increment(0) or decrement(1) mode 2 Chopping mode $[10:9]$ Synchronization type: Manual(0), Request(1) or Linked List(2) $[18:16]$ Chopping DMA window $[22:20]$ Chopping CPU window 24 Enable 28 Manual trigger $[30:29]$ Unknown

: DMA Channel Control register description

[[tab:dma_channel_control]]{#tab:dma_channel_control label="tab:dma_channel_control"}

The next DMA access is at offset 0x28. This is the control register for channel 2 (the GPU). This register contain many important fields described in table [tab:dma_channel_control]{reference-type="ref" reference="tab:dma_channel_control"}.

Bit 0 sets the transfer direction (RAM-to-device or device-to-RAM), bit 1 tells us whether the DMA must increment or decrement the address in RAM during the transfer.

Bits [10:9] configure the type of synchronization: the DMA either copies all the data at once (Manual sync) or it can wait for the device to raise a "ready" flag to request more data or say that data is available when reading (Request sync). There's a third mode, Linked List sync, which is used with the GPU. We'll explain what it does when we look at ordering tables in a moment.

Bit 24 enables the channel and starts the transfer in Request or Linked List sync mode. Bit 28 is the trigger to start the tranfer in Manual sync mode.

Bit 8 enables "chopping": when active the DMA will periodically stop to let the CPU run for a while. Bits "[18:16]" and "[22:20]" respectively say how often and for how long the control must be given back to the CPU. At this point I'm not entirely sure if chopping only works in Manual sync mode or all the time. It doesn't really matter since we won't implement it for now.

Finally bits [30:29] are read/write but I don't know what they do28.

The current value of '0x401' sets the transfer direction to RAM-to-device and the sync mode to Linked List. It doesn't set bit '24' to enable the channel however so nothing happens29.

Since there are 7 DMA channels I'm going to factor all channel-related code in a Channel structure:

/// Per-channel data
struct Channel {
    enable: bool,
    direction: Direction,
    step: Step,
    sync: Sync,
    /// Used to start the DMA transfer when `sync` is `Manual`
    trigger: bool,
    /// If true the DMA "chops" the transfer and lets the CPU run
    /// in the gaps.
    chop: bool,
    /// Chopping DMA window size (log2 number of words)
    chop_dma_sz: u8,
    /// Chopping CPU window size (log2 number of cycles)
    chop_cpu_sz: u8,
    /// Unkown 2 RW bits in configuration register
    dummy: u8,
}


impl Channel {
    fn new() -> Channel {
        Channel {
            enable: false,
            direction: Direction::ToRam,
            step: Step::Increment,
            sync: Sync::Manual,
            trigger: false,
            chop: false,
            chop_dma_sz: 0,
            chop_cpu_sz: 0,
            dummy: 0,
        }
    }

    pub fn control(&self) -> u32 {
        let mut r = 0;

        r |= (self.direction as u32) << 0;
        r |= (self.step as u32) << 1;
        r |= (self.chop as u32) << 8;
        r |= (self.sync as u32) << 9;
        r |= (self.chop_dma_sz as u32) << 16;
        r |= (self.chop_cpu_sz as u32) << 20;
        r |= (self.enable as u32) << 24;
        r |= (self.trigger as u32) << 28;
        r |= (self.dummy as u32) << 29;

        r
    }

    pub fn set_control(&mut self, val: u32) {
        self.direction = match val & 1 != 0 {
            true  => Direction::FromRam,
            false => Direction::ToRam,
        };

        self.step = match (val >> 1) & 1 != 0 {
            true  => Step::Decrement,
            false => Step::Increment,
        };

        self.chop = (val >> 8) & 1 != 0;

        self.sync = match (val >> 9) & 3 {
            0 => Sync::Manual,
            1 => Sync::Request,
            2 => Sync::LinkedList,
            n => panic!("Unknown DMA sync mode {}", n),
        };

        self.chop_dma_sz = ((val >> 16) & 7) as u8;
        self.chop_cpu_sz = ((val >> 20) & 7) as u8;

        self.enable  = (val >> 24) & 1 != 0;
        self.trigger = (val >> 28) & 1 != 0;

        self.dummy = ((val >> 29) & 3) as u8;
    }
}

/// DMA transfer direction
pub enum Direction {
    ToRam   = 0,
    FromRam = 1,
}

/// DMA transfer step
pub enum Step {
    Increment = 0,
    Decrement = 1,
}

/// DMA transfer synchronization mode
pub enum Sync {
    /// Transfer starts when the CPU writes to the Trigger bit and
    /// transfers everything at once
    Manual = 0,
    /// Sync blocks to DMA requests
    Request = 1,
    /// Used to transfer GPU command lists
    LinkedList = 2,
}

We can then put an array of 7 Channel instances in our struct Dma with some methods to access them in the interconnect:

/// Direct Memory Access
pub struct Dma {
    //...

    /// The 7 channel instances
    channels: [Channel; 7],
}

impl Dma {
    //...

    /// Return a reference to a channel by port number.
    pub fn channel(&self, port: Port) -> &Channel {
        &self.channels[port as usize]
    }

    /// Return a mutable reference to a channel by port number.
    pub fn channel_mut(&mut self, port: Port) -> &mut Channel {
        &mut self.channels[port as usize]
    }
}

/// The 7 DMA ports
pub enum Port {
    /// Macroblock decoder input
    MdecIn = 0,
    /// Macroblock decoder output
    MdecOut = 1,
    /// Graphics Processing Unit
    Gpu = 2,
    /// CD-ROM drive
    CdRom = 3,
    /// Sound Processing Unit
    Spu = 4,
    /// Extension port
    Pio = 5,
    /// Used to clear the ordering table
    Otc = 6,
}

impl Port {
    pub fn from_index(index: u32) -> Port {
        match index {
            0 => Port::MdecIn,
            1 => Port::MdecOut,
            2 => Port::Gpu,
            3 => Port::CdRom,
            4 => Port::Spu,
            5 => Port::Pio,
            6 => Port::Otc,
            n => panic!("Invalid port {}", n),
        }
    }
}

That's quite a lot of code to parse one register but it should make our life easier later on.

Since the 7 channels have the same register layout we can rewrite our Interconnect methods to be a little more generic:

Impl Interconnect {
    //...

    /// DMA register read
    fn dma_reg(&self, offset: u32) -> u32 {
        let major = (offset & 0x70) >> 4;
        let minor = offset & 0xf;

        match major {
            // Per-channel registers
            0...6 => {
                let channel = self.dma.channel(Port::from_index(major));

                match minor {
                    8 => channel.control(),
                    _ => panic!("Unhandled DMA read at {:x}", offset)
                }
            },
            // Common DMA registers
            7 => match minor {
                0 => self.dma.control(),
                4 => self.dma.interrupt(),
                _ => panic!("Unhandled DMA read at {:x}", offset)
            },
            _ => panic!("Unhandled DMA read at {:x}", offset)
        }
    }

    /// DMA register write
    fn set_dma_reg(&mut self, offset: u32, val: u32) {
        let major = (offset & 0x70) >> 4;
        let minor = offset & 0xf;

        match major {
            // Per-channel registers
            0...6 => {
                let port = Port::from_index(major);
                let channel = self.dma.channel_mut(port);

                match minor {
                    8 => channel.set_control(val),
                    _ => panic!("Unhandled DMA write {:x}: {:08x}",
                                offset, val)
                }
            },
            // Common DMA registers
            7 => {
                match minor {
                    0 => self.dma.set_control(val),
                    4 => self.dma.set_interrupt(val),
                    _ => panic!("Unhandled DMA write {:x}: {:08x}",
                                offset, val),
                }
            }
            _ => panic!("Unhandled DMA write {:x}: {:08x}",
                        offset, val),
        };
    }
}

DMA Base Address register

After that the BIOS writes 0x800eb8d4 to DMA register offset 0x60. It means that the BIOS now moved to channel 6 (Clear Ordering Table) and sets the Base Address register. Only the low 24 bits are used since it's plenty enough to address the whole RAM. This one is pretty straightforward: it gives the address of the first word to be read or written in RAM. We can add it to our struct Channel:

/// Per-channel data
struct Channel {
    //...

    /// DMA start address
    base: u32,
}

impl Channel {
    //...

    fn new() -> Channel {
        Channel {
            //...

            base: 0,
        }
    }

    /// Retrieve the channel's base address
    pub fn base(&self) -> u32 {
        self.base
    }

    /// Set channel base address. Only bits [0:23] are significant so
    /// only 16MB are addressable by the DMA
    pub fn set_base(&mut self, val: u32) {
        self.base = val & 0xffffff;
    }
}

DMA Block Control register

After plugging the base address methods in the interconnect we can proceed to our next DMA register access: it's the value 0x00000400 at offset 0x64. This is our last unhandled DMA channel register: the Block Control. Its meaning depends on the synchronization type in the channel Control register (see table [tab:dma_channel_control]{reference-type="ref" reference="tab:dma_channel_control"}:

  • In Manual sync mode only the low 16bits are used and they contain the number of words to transfer.

  • In Request sync mode the low 16 bits contain the block size in words while the upper 16bits contain the number of blocks to transfer. The DMA will transfer a block at a time and wait for the device to assert the "request" flag before starting a new block.

  • In Linked List mode this register is not used.

We can store the contents of this registers in two u16s:

/// Per-channel data
#[derive(Copy)]
struct Channel {
    //...

    /// Size of a block in words
    block_size: u16,
    /// Block count, Only used when `sync` is `Request`
    block_count: u16,
}

impl Channel {
    //...

    fn new() -> Channel {
        Channel {
            block_size: 0,
            block_count: 0,
        }
    }

    /// Retrieve value of the Block Control register
    pub fn block_control(&self) -> u32 {
        let bs = self.block_size as u32;
        let bc = self.block_count as u32;

        (bc << 16) | bs
    }

    /// Set value of the Block Control register
    pub fn set_block_control(&mut self, val: u32) {
        self.block_size = val as u16;
        self.block_count = (val >> 16) as u16;
    }
}

We can see that the BIOS initialized a base address and block size for channel 6, it's no surprise that it then writes 0x11000002 to the channel control register.

The configuration is Manual sync mode, towards the RAM, with decreasing addresses and it sets the enable and trigger bits to start the transfer.

We can now implement the DMA copy itself but before we do so we must understand what this channel does exactly.

Depth Ordering Tables

DMA channel 6 is used to clear an ordering table in RAM. To understand what it means we need a little background on the Playstation graphics pipeline.

The Playstation is an early 3D console and as such its 3D support is a bit spotty. In particular the GPU doesn't handle 3D primitives at all. That might be surprising but as we'll see later the GPU can only draw 2 dimensional primitives like lines, triangles and rectangles in the framebuffer. There's no $Z$ coordinate and therefore no z-buffer or anything like that. If you have two overlapping triangles whichever is drawn last will appear on top of the other.

That means that when a game wants to render a 3D scene it can't just create a vertex buffer with 3D coordinates and have the GPU do the projection by itself since it can only rasterize 2D graphics. Instead the CPU must do the projection and send the draw commands in the right order (that is, from farthest to closest from the point of view of the camera) to the GPU. This way closer objects will appear above more distant ones when they overlap.

In order to do those computations more efficiently the CPU has a coprocessor called the "Geometry Transfor Engine" which can be used to project the primitives and compute their distance to the camera.

All this code needs to be pretty efficient because in 3D games the camera's and objects' positions can change at every frame which means that the position of all primitives must potentially be recomputed every time. And in order to do this more efficiently the Playstation hardware supports a construct called "depth ordering tables".

Let's consider a concrete example: a game wants to draw a cube. In order to do that it needs to render 6 quadrilaterals (or quads for short), one for each side. Let's assume that the player can move around the cube so that the game doesn't know ahead of time which side will be facing the camera.

We've seen that the game has to send the commands in the right order to the GPU otherwise the back side might appear in front for instance. That means that it must sort the primitives (the 6 quads) from back to front before sending the commands to the GPU.

One possibility would be for the game to allocate a buffer big enough to contain all the draw commands for the current scene, fill it with all the projected primitives while sorting them in the correct order. If you want to draw a cube that's probably fine but for a complex scene with thousands of draw commands the CPU load will become huge, it'll spend its time sorting draw commands in RAM.

Fortunately there's an other solution: in order to keep the draw commands ordered while not having to move things around all the time they're stored in a linked list. As you know inserting an entry between two elements in a linked list is very cheap: you just rewrite the element's list pointers and you're done.

So here's how a depth ordering table is implemented: each command is stored in a "packet", somewhere in RAM. A packet starts with a 32bit "header" word. The low 24bits of that word are the address of the next packet in RAM or 0xffffff if it's the last item and the high 8bits are the number of words in the packet.

You start with an empty table: you create an array of empty packets in RAM (only 32bit headers with the high 8bits set to 0 to indicate they're empty) and you make each entry point to the address of the previous one and the last one set to 0xffffff. So you have a linked list of empty elements stored in an array in reverse order. Sounds silly but it's actually very handy.

Now when the CPU wants to render a primitive it computes its distance to the camera, normalizes it over the size of the ordering table and uses it as an index. It can then take the value of the header at location in the table and insert the draw command in the list at that point. This way it doesn't have to iterate through the entire list to figure out where the primitive goes, the ordering table effectively works like a lookup table.

No matter the size of the scene, no matter how many elements have already been inserted in the list you can always insert a new draw command by creating a packet in ram, figuring out the depth index and updating two headers to insert yourself in the right order. The computing cost is constant.

Of course, there can be collisions. Since there are only a finite number of positions in the depth ordering tables two or more packets can end up sharing the same slot. When that happens the newer element will point to the previous one and will therefore be drawn first (regardless of whether it's actually on front or behind). The smaller the table the smaller the granularity. That explains some of the visual glitches you can see in a lot of 3D games on the console, it's just a limitation of the hardware.

Once the game has finished projecting and sorting the scene's draw command it can send it to the GPU by starting from the last entry in the depth ordering table and then iterating through the linked list until it reaches the 0xffffff end-of-list marker.

DMA Clear Ordering Table channel

Enough theory, let's implement DMA channel 6. So far I've encapsulated all DMA-related code in the Dma and Channel structs, unfortunately putting the copy code itself in them is a bit troublesome in Rust. The problem is that this code needs to hold a reference to the various DMA-capable peripherals (RAM, GPU, SPU, etc...) but Rust adds a lot of constraints on references (and especially mutable references) to make sure the code is completely memory-safe.

There are ways to work around that (using RefCells, unsafe code etc...) but I don't want to bother with any of this so I'm just going to implement the copy code directly in the Interconnect since it already has access to all the peripherals.

First, in the set_dma_reg function I'm going to check if a write to a DMA channel register activated it:

impl Interconnect {
//...

    /// DMA register write
    fn set_dma_reg(&mut self, offset: u32, val: u32) {
        let major = (offset & 0x70) >> 4;
        let minor = offset & 0xf;

        let active_port =
            match major {
                // Per-channel registers
                0...6 => {
                    let port = Port::from_index(major);
                    let channel = self.dma.channel_mut(port);

                    match minor {
                        0 => channel.set_base(val),
                        4 => channel.set_block_control(val),
                        8 => channel.set_control(val),
                        _ =>
                        panic!("Unhandled DMA write {:x}: {:08x}",
                               offset, val)
                    }

                    if channel.active() {
                        Some(port)
                    } else {
                        None
                    }
                },
                // Common DMA registers
                7 => {
                    //...

                    None
                }
                _ => panic!("Unhandled DMA write {:x}: {:08x}",
                            offset, val),
            };

        if let Some(port) = active_port {
            self.do_dma(port);
        }
    }
}

impl Channel {
    //...

    /// Return true if the channel has been started
    pub fn active(&self) -> bool {
        // In manual sync mode the CPU must set the "trigger" bit
        // to start the transfer.
        let trigger = match self.sync {
            Sync::Manual => self.trigger,
            _            => true,
        };

        self.enable && trigger
    }
}

Now the Interconnect's do_dma method will be called when a transfer must take place.

The Manual and Request modes both copy blocks of data from/to the RAM. Linked List mode is a bit different since it hops around the RAM following the pointers in the headers. For this reason making a generic function to handle all three modes will be a bit tricky, I prefer to handle linked list separately:

impl Interconnect {
    //...

    /// Execute DMA transfer for a port
    fn do_dma(&mut self, port: Port) {
        // DMA transfer has been started, for now let's
        // process everything in one pass (i.e. no
        // chopping or priority handling)

        match self.dma.channel(port).sync() {
            Sync::LinkedList => panic!("Linked list mode unsupported"),
            _                => self.do_dma_block(port),
        }
    }
}

DMA Block copy

We can now implement the block copy function itself. We start at the base address, we compute how many words we must copy by looking at the block control values. Then we enter the copy loop: depending on the copy direction we either read a word from RAM and send it to the device or the other way around.

Since channel 6 is only used to initialize an ordering table we only need to implement the 'ToRam' direction for now. Also the value copied into RAM doesn't come from an external peripheral, it's just generated by the DMA based on the current address:

impl Interconnect {
    //...

    fn do_dma_block(&mut self, port: Port) {
        let channel = self.dma.channel_mut(port);

        let increment = match channel.step() {
            Step::Increment =>  4,
            Step::Decrement => -4,
        };

        let mut addr = channel.base();

        // Transfer size in words
        let mut remsz = match channel.transfer_size() {
            Some(n) => n,
            // Shouldn't happen since we shouldn't be reaching this
            // code in linked list mode
            None    =>
              panic!("Couldn't figure out DMA block transfer size"),
        };

        while remsz > 0 {
            // Not sure what happens if address is
            // bogus... Mednafen just masks addr this way, maybe
            // that's how the hardware behaves (i.e. the RAM
            // address wraps and the two LSB are ignored, seems
            // reasonable enough
            let cur_addr = addr & 0x1ffffc;

            match channel.direction() {
                Direction::FromRam => panic!("Unhandled DMA direction"),
                Direction::ToRam => {
                    let src_word = match port {
                        // Clear ordering table
                        Port::Otc => match remsz {
                            // Last entry contains the end
                            // of table marker
                            1 => 0xffffff,
                            // Pointer to the previous entry
                            _ => addr.wrapping_sub(4) & 0x1fffff,
                        },
                        _ => panic!("Unhandled DMA source port {}",
                                    port as u8),
                    };

                    self.ram.store32(cur_addr, src_word);
                }
            }

            addr = addr.wrapping_add(increment);
            remsz -= 1;
        }

        channel.done();
    }
}

impl Channel {
    //...

    pub fn direction(&self) -> Direction {
        self.direction
    }

    pub fn step(&self) -> Step {
        self.step
    }

    pub fn sync(&self) -> Sync {
        self.sync
    }

    /// Return the DMA transfer size in bytes or None for linked
    /// list mode.
    pub fn transfer_size(&self) -> Option<u32> {
        let bs = self.block_size as u32;
        let bc = self.block_count as u32;

        match self.sync {
            // For manual mode only the block size is used
            Sync::Manual => Some(bs),
            // In DMA request mode we must transfer `bc` blocks
            Sync::Request => Some(bc * bs),
            // In linked list mode the size is not known ahead of
            // time: we stop when we encounter the "end of list"
            // marker (0xffffff)
            Sync::LinkedList => None,
        }
    }

    /// Set the channel status to "completed" state
    pub fn done(&mut self) {
        self.enable = false;
        self.trigger = false;

        // XXX Need to set the correct value for the other fields
        // (in particular interrupts)
    }
}

Note the conditional to write 0xffffff in the last iteration, it's of course important because otherwise the DMA won't find the end of table marker and start jumping randomly in RAM, sending crap to the GPU in the process. It's important to note that this is vital, and even the "Sony Logo" won't render correctly if this is not implemented, if you're not receiving GP0(38h), this is probably why.

When the copy is done I call the channel.done() method which clears the trigger and enable flags. It should probably do more than that eventually, in particular it should trigger the interrupt if it's enabled. We'll leave that for later.

We can now finally run our first DMA transfer in full! The BIOS sets the base address to 0x000eb8d4 and the block size to 1024 before starting channel 6 and we then initialize an empty ordering table.

After that the BIOS enters an infinite loop on the GPUSTAT register. This time it's waiting for bit 26 which is "ready to receive command word". We are going to set this bit by default and while we're at it we're also going to add bit 27 which is "ready to send VRAM to CPU". This way we should avoid locking the BIOS on this register in the future:

impl Interconnect {
    //...

    /// Load 32bit word at `addr`
    pub fn load32(&self, addr: u32) -> u32 {
        //...

        if let Some(offset) = map::GPU.contains(abs_addr) {
            return match offset {
                // GPUSTAT: set bit 26, 27 28 to signal that the GPU
                // is ready for DMA and CPU access. This way the BIOS
                // won't dead lock waiting for an event that'll never
                // come.
                4 => 0x1c000000,
                _ => 0,
            }
        }
    }
}

With this modification the BIOS goes a little further and configures DMA channel 2 to send a Linked List to the GPU.

DMA Linked Lists

Navigating the linked list is pretty straightforward: the BIOS puts the address of the first list header is the DMA channel's base address. We read the high byte of the header to know the size of the packet (in words, not counting the header). Packets are continuous in RAM so the data follows the header word directly.

Once the packet data has been sent to the device we look at the low 24bits of the header. If it's 0xffffff then we're done, otherwise it contains the address of the next header and we loop.

I'm not sure about if linked list mode is supported only by channel 2 (the GPU) or if it's available for other ports. As far as I can tell it's only ever used to send commands to the GPU however, I'll have to remember test that.

By the way, interesting bit of information for us emulator writers: it seems that while the DMA offers a great deal of flexibility with a lot options and flags only a handful of configs are ever used for each channel. PCSX-R hardcodes those configs and simply ignores more exotic flag combinations (even though they're technically possible) and mednafen, while supporting most options, has an optimized fast path for the common configs. The Nocash's docs also lists those common configs (and the few odd variations in some games). It means that we can probably go a long way even if we don't support some obscure configurations.

Here's what my simple linked list synchronization mode implementation looks like:

impl Interconnect {
    //...

    /// Execute DMA transfer for a port
    fn do_dma(&mut self, port: Port) {
        // DMA transfer has been started, for now let's
        // process everything in one pass (i.e. no
        // chopping or priority handling)

        match self.dma.channel(port).sync() {
                Sync::LinkedList => self.do_dma_linked_list(port),
                _                => self.do_dma_block(port),
        }
    }

    /// Emulate DMA transfer for linked list synchronization mode.
    fn do_dma_linked_list(&mut self, port: Port) {
        let channel = self.dma.channel_mut(port);

        let mut addr = channel.base() & 0x1ffffc;

        if channel.direction() == Direction::ToRam {
            panic!("Invalid DMA direction for linked list mode");
        }

        // I don't know if the DMA even supports linked list mode
        // for anything besides the GPU
        if port != Port::Gpu {
            panic!("Attempted linked list DMA on port {}",
                   port as u8);
        }

        loop {
            // In linked list mode, each entry starts with a
            // "header" word. The high byte contains the number
            //  of words in the "packet" (not counting the header
            //  word)
            let header = self.ram.load32(addr);

            let mut remsz = header >> 24;

            while remsz > 0 {
                addr = (addr + 4) & 0x1ffffc;

                let command = self.ram.load32(addr);

                println!("GPU command {:08x}", command);

                remsz -= 1;
            }

            // The end-of-table marker is usually 0xffffff but
            // mednafen only checks for the MSB so maybe that's what
            // the hardware does? Since this bit is not part of any
            // valid address it makes some sense. I'll have to test
            // that at some point...
            if header & 0x800000 != 0 {
                break;
            }

            addr = header & 0x1ffffc;
        }

        channel.done();
    }
}

Since we haven't implement the GPU yet I just display the command word without further processing. We'll have to hook our GPU rendering code here when it's done. Let's get a bit further in our DMA implementation before we start working on the GPU, don't have anything interesting to display yet.

RAM to device GPU block copy

After this the BIOS wants to do an other DMA transfer from the RAM towards the GPU but this time in Request synchronization mode. It probably wants to load a texture. Adding support for this in our do_dma_block function is quite trivial:

impl Interconnect {
    //...

    /// Emulate DMA transfer for Manual and Request synchronization
    /// modes.
    fn do_dma_block(&mut self, port: Port) {
        //...

        while remsz > 0 {
            //...

            match channel.direction() {
                Direction::FromRam => {
                    let src_word = self.ram.load32(cur_addr);

                    match port {
                        Port::Gpu => println!("GPU data {:08x}", src_word),
                        _ => panic!("Unhandled DMA destination port {}",
                                    port as u8),
                    }
                }
        //...
            }

            addr = addr.wrapping_add(increment);
            remsz -= 1;
        }

        channel.done();
    }
}

We still can't do much more than printing the raw GPU data but at least the DMA part seems to work as intended. If we try to interpret the GPU commands sent through the linked list we can guess what it's doing30:

  • First it displays a black quadrilateral that takes the whole screen (command 0x28000000). It does this several times.

  • Then it appears to load a texture (maybe the background with the text?)

  • Then it draws the same quadrilateral again but with a dark-grey color (command 0x28030303 where 0x030303 is a 24bit BGR colour)

  • Then it draws it again repetedly, slowly changing the colour to a lighter grey, it looks like the "fade-in" effect at the very beginning of the boot animation. (commands 0x28060606, 0x28090909 etc... to 0x28b4b4b4)

  • Then it adds three more draw commands: 0x380000b2 which draws a shaded quadrilateral and two 0x300000b2 commands which draw shaded triangles.

Then a little while after that we stumble upon an unhandled store8 at address 0x1f801800 which is a CDROM drive register. I'm pleased we managed to get to that point with our bare bones emulator. We don't even support interrupts!

But before we look at this CDROM business it's tempting to try to implement a basic GPU and display our first frames. After all it seems that our emulator manages to go through the entire first boot logo. It'll be more rewarding to see that than the hexadecimal debug dumps we've become accustomed to and it'll validate that our CPU is working correctly.

The GPU: Internal state and first commands

We're finally getting to the fun part: drawing on the screen. The objective of this part in twofold:

  • We want to create a reasonably accurate internal representation of the PSX GPU. Mainly we want to update the register values to reflect the current GPU state instead of our current hardcoded values. This will layout a basic GPU state machine that we'll improve later when we'll implement video timings, interrupts and other delicacies.

  • We'll also implement a very simple and innacurate OpenGL renderer. That'll give us the opportunity to implement some of the very boring low level OpenGL boilerplate and we'll have some visual feedback for debugging the rest of the emulator.

In order to do this we'll start back from the beginning, review all the GPU register accesses (both from the CPU and DMA) and attempt to implement them as best as we can.

GPUSTAT register

The GPU only has a single status register but it's packed full of miscelaneous information about the GPU state. It contains fields describing the texture mapping config, the video mode, the various "ready" bits for the command FIFOs, the color mode etc...

We can start by declaring the various variables holding all that state. In that end I'm going to create a whole bunch of new types in order to manipulate nice type-safe symbolic values instead of meaningless integers:

pub struct Gpu {
    /// Texture page base X coordinate (4 bits, 64 byte increment)
    page_base_x: u8,
    /// Texture page base Y coordinate (1bit, 256 line increment)
    page_base_y: u8,
    /// Semi-transparency. Not entirely sure how to handle that value
    /// yet, it seems to describe how to blend the source and
    /// destination colors.
    semi_transparency: u8,
    /// Texture page color depth
    texture_depth: TextureDepth,
    /// Enable dithering from 24 to 15bits RGB
    dithering: bool,
    /// Allow drawing to the display area
    draw_to_display: bool,
    /// Force "mask" bit of the pixel to 1 when writing to VRAM
    /// (otherwise don't modify it)
    force_set_mask_bit: bool,
    /// Don't draw to pixels which have the "mask" bit set
    preserve_masked_pixels: bool,
    /// Currently displayed field. For progressive output this is
    /// always Top.
    field: Field,
    /// When true all textures are disabled
    texture_disable: bool,
    /// Video output horizontal resolution
    hres: HorizontalRes,
    /// Video output vertical resolution
    vres: VerticalRes,
    /// Video mode
    vmode: VMode,
    /// Display depth. The GPU itself always draws 15bit RGB, 24bit
    /// output must use external assets (pre-rendered textures, MDEC,
    /// etc...)
    display_depth: DisplayDepth,
    /// Output interlaced video signal instead of progressive
    interlaced: bool,
    /// Disable the display
    display_disabled: bool,
    /// True when the interrupt is active
    interrupt: bool,
    /// DMA request direction
    dma_direction: DmaDirection,
}

/// Depth of the pixel values in a texture page
#[derive(Copy)]
enum TextureDepth {
    /// 4 bits per pixel
    T4Bit = 0,
    /// 8 bits per pixel
    T8Bit = 1,
    /// 15 bits per pixel
    T15Bit = 2,
}

/// Interlaced output splits each frame in two fields
#[derive(Copy)]
enum Field {
    /// Top field (odd lines).
    Top = 1,
    /// Bottom field (even lines)
    Bottom = 0,
}

/// Video output horizontal resolution
#[derive(Copy)]
struct HorizontalRes(u8);

impl HorizontalRes {
    /// Create a new HorizontalRes instance from the 2 bit field `hr1`
    /// and the one bit field `hr2`
    fn from_fields(hr1: u8, hr2: u8) -> HorizontalRes {
        let hr = (hr2 & 1) | ((hr1 & 3) << 1);

        HorizontalRes(hr)
    }

    /// Retrieve value of bits [18:16] of the status register
    fn into_status(self) -> u32 {
        let HorizontalRes(hr) = self;

        (hr as u32) << 16
    }
}

/// Video output vertical resolution
#[derive(Copy)]
enum VerticalRes {
    /// 240 lines
    Y240Lines = 0,
    /// 480 lines (only available for interlaced output)
    Y480Lines = 1,
}

/// Video Modes
#[derive(Copy)]
enum VMode {
    /// NTSC: 480i60H
    Ntsc = 0,
    /// PAL: 576i50Hz
    Pal  = 1,
}

/// Display area color depth
#[derive(Copy)]
enum DisplayDepth {
    /// 15 bits per pixel
    D15Bits = 0,
    /// 24 bits per pixel
    D24Bits = 1,
}

/// Requested DMA direction.
#[derive(Copy)]
enum DmaDirection {
    Off = 0,
    Fifo = 1,
    CpuToGp0 = 2,
    VRamToCpu = 3,
}

This is basically a direct translation of the GPUSTAT register fields. I must say that at that point I don't fully understand all of those variables and it's possible that we'll have to change this implementation or maybe simply rename some of them. It does however give us a foretaste of the various features/quirks of the Playstation GPU that we'll have to implement eventually if we want to make an accurate renderer. If some of those variables mean nothing to you don't worry, we'll review them all when we actually need them.

I'm not entirely sure what's the GPU state at reset but I think the BIOS will reconfigure everything anyway. Let's assume that all the values are 0 on reset, except for the display_disabled field:

impl Gpu {
    pub fn new() -> Gpu {
        Gpu {
            page_base_x: 0,
            page_base_y: 0,
            semi_transparency: 0,
            texture_depth: TextureDepth::T4Bit,
            dithering: false,
            draw_to_display: false,
            force_set_mask_bit: false,
            preserve_masked_pixels: false,
            field: Field::Top,
            texture_disable: false,
            hres: HorizontalRes::from_fields(0, 0),
            vres: VerticalRes::Y240Lines,
            vmode: VMode::Ntsc,
            display_depth: DisplayDepth::D15Bits,
            interlaced: false,
            display_disabled: true,
            interrupt: false,
            dma_direction: DmaDirection::Off,
        }
    }
}

For the time being we can implement the GPUSTAT register read. It's a read-only register since writes to the GPUSTAT register address end up in the GP1 register. We'll see how the GPU config is modified in a minute.

impl Gpu {
    //...

    /// Retrieve value of the status register
    pub fn status(&self) -> u32 {
        let mut r = 0u32;

        r |= (self.page_base_x as u32) << 0;
        r |= (self.page_base_y as u32) << 4;
        r |= (self.semi_transparency as u32) << 5;
        r |= (self.texture_depth as u32) << 7;
        r |= (self.dithering as u32) << 9;
        r |= (self.draw_to_display as u32) << 10;
        r |= (self.force_set_mask_bit as u32) << 11;
        r |= (self.preserve_masked_pixels as u32) << 12;
        r |= (self.field as u32) << 13;
        // Bit 14: not supported
        r |= (self.texture_disable as u32) << 15;
        r |= self.hres.into_status();
        r |= (self.vres as u32) << 19;
        r |= (self.vmode as u32) << 20;
        r |= (self.display_depth as u32) << 21;
        r |= (self.interlaced as u32) << 22;
        r |= (self.display_disabled as u32) << 23;
        r |= (self.interrupt as u32) << 24;

        // For now we pretend that the GPU is always ready:
        // Ready to receive command
        r |= 1 << 26;
        // Ready to send VRAM to CPU
        r |= 1 << 27;
        // Ready to receive DMA block
        r |= 1 << 28;

        r |= (self.dma_direction as u32) << 29;

        // Bit 31 should change depending on the currently drawn
        // line (whether it's even, odd or in the vblack
        // apparently). Let's not bother with it for now.
        r |= 0 << 31;

        // Not sure about that, I'm guessing that it's the signal
        // checked by the DMA in when sending data in Request
        // synchronization mode. For now I blindly follow the
        // Nocash spec.
        let dma_request =
            match self.dma_direction {
                // Always 0
                DmaDirection::Off => 0,
                // Should be 0 if FIFO is full, 1 otherwise
                DmaDirection::Fifo => 1,
                // Should be the same as status bit 28
                DmaDirection::CpuToGp0 => (r >> 28) & 1,
                // Should be the same as status bit 27
                DmaDirection::VRamToCpu => (r >> 27) & 1,
            };

        r |= dma_request << 25;

        r
    }
}

You can see that I don't support bit 14: the Nocash spec says that when this bit is set on the real hardware just messes up the display in a weird way. We can probably assume that it's not a commonly used feature for the moment.

As before I hardcode the "ready" bits to 1 since we have a long way to go before we have the necessary infrastructure to emulate them accurately. We'll need to emulate the various internal FIFOs and the rate at which they empty for instance. That will come later.

In general I'm not entirely sure how the DMA state machine synchronizes with the GPU. We'll have to hope it's not too critical for now. As we progress if we start to notice that our emulator seems to misbehave because of a broken GPU DMA we'll have to investigate further.

GP0 Dram Mode Setting command

All the GPU configuration and draw commands are transferred through two registers: GP0 and GP1. GP0 is used to send drawing commands (lines, triangles, quadrilaterals with various attributes) and to copy data between the VRAM (the video RAM dedicated to the GPU) and the CPU/DMA.

We'll have to decode those command like we decoded the CPU instructions. The format is pretty simple: the most significant byte is the "opcode" and the rest are parameters whose meaning depends on the opcode. The only difficulty is that GP0 commands can take a variable amount of parameters and therefore fit in multiple words.

The first command sent by the BIOS into GP0 is 0xe1003000. The high byte is 0xe1 which is the "Draw Mode setting" command. It sets a bunch of texture-related values (dithering, texture depth, texture disable, etc...) and two new fields we haven't already encountered in the GPUSTAT register: rectangle_texture_x_flip and rectangle_texture_y_flip. They're used to mirror a textured rectangle horizontally or vertically:

pub struct Gpu {
    //...

    /// Mirror textured rectangles along the x axis
    rectangle_texture_x_flip: bool,
    /// Mirror textured rectangles along the y axis
    rectangle_texture_y_flip: bool,
}

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        let opcode = (val >> 24) & 0xff;

        match opcode {
            0xe1 => self.gp0_draw_mode(val),
            _    => panic!("Unhandled GP0 command {:08x}", val),
        }
    }

    /// GP0(0xE1) command
    fn gp0_draw_mode(&mut self, val: u32) {
        self.page_base_x = (val & 0xf) as u8;
        self.page_base_y = ((val >> 4) & 1) as u8;
        self.semi_transparency = ((val >> 5) & 3) as u8;

        self.texture_depth =
            match (val >> 7) & 3 {
                0 => TextureDepth::T4Bit,
                1 => TextureDepth::T8Bit,
                2 => TextureDepth::T15Bit,
                n => panic!("Unhandled texture depth {}", n),
            };

        self.dithering = ((val >> 9) & 1) != 0;
        self.draw_to_display = ((val >> 10) & 1) != 0;
        self.texture_disable = ((val >> 11) & 1) != 0;
        self.rectangle_texture_x_flip = ((val >> 12) & 1) != 0;
        self.rectangle_texture_y_flip = ((val >> 13) & 1) != 0;
    }
}

We can now call our new gp0 method from the interconnect:

impl Interconnect {
    //...

    /// Store 32bit word `val` into `addr`
    pub fn store32(&mut self, addr: u32, val: u32) {
        //...

        if let Some(offset) = map::GPU.contains(abs_addr) {
            match offset {
                0 => self.gpu.gp0(val),
                _ => panic!("GPU write {}: {:08x}", offset, val),
            }
            return;
        }

        //...
    }
}

GP0 NOP command

The next GP0 command sent by the BIOS is '0x0007920c'. Apparently opcode '0x00' is a NOP so I'm not sure what's the meaning of the '0x7920c' given as parameter31. Maybe it's just a garbage value. Let's ignore it for now and implement the NOP:

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        let opcode = (val >> 24) & 0xff;

        match opcode {
            0x00 => (), // NOP
            0xe1 => self.gp0_draw_mode(val),
            _    => panic!("Unhandled GP0 opcode {:08x}", val),
        }
    }
}

GP1 Soft Reset command

After that the BIOS writes 0x00000000 to GP1 this time. The command format is the same: the high byte is the opcode while the low 24bits contain the parameters. However GP1 has a different set of commands mostly used to configure the display and the DMA. GP1 commands are always one word in length.

GP1 opcode 0x00 is a software reset command, it resets the GPU to a default configuration. It reconfigures most of the fields we've already encountered and a few more:

pub struct Gpu {
    //...

    /// Texture window x mask (8 pixel steps)
    texture_window_x_mask: u8,
    /// Texture window y mask (8 pixel steps)
    texture_window_y_mask: u8,
    /// Texture window x offset (8 pixel steps)
    texture_window_x_offset: u8,
    /// Texture window y offset (8 pixel steps)
    texture_window_y_offset: u8,
    /// Left-most column of drawing area
    drawing_area_left: u16,
    /// Top-most line of drawing area
    drawing_area_top: u16,
    /// Right-most column of drawing area
    drawing_area_right: u16,
    /// Bottom-most line of drawing area
    drawing_area_bottom: u16,
    /// Horizontal drawing offset applied to all vertex
    drawing_x_offset: i16,
    /// Vertical drawing offset applied to all vertex
    drawing_y_offset: i16,
    /// First column of the display area in VRAM
    display_vram_x_start: u16,
    /// First line of the display area in VRAM
    display_vram_y_start: u16,
    /// Display output horizontal start relative to HSYNC
    display_horiz_start: u16,
    /// Display output horizontal end relative to HSYNC
    display_horiz_end: u16,
    /// Display output first line relative to VSYNC
    display_line_start: u16,
    /// Display output last line relative to VSYNC
    display_line_end: u16,
}

I tried to get the reset value from my console, unfortunately some of the values like display_horiz_ and display_line_ cannot be read directly from any register as far as I can tell so I'm going to use the values given by the NoCash specs instead.

impl Gpu {
    //...

    /// Handle writes to the GP1 command register
    pub fn gp1(&mut self, val: u32) {
        let opcode = (val >> 24) & 0xff;

        match opcode {
            0x00 => self.gp1_reset(val),
            _    => panic!("Unhandled GP1 command {:08x}", val),
        }
    }

    /// GP1(0x00): soft reset
    fn gp1_reset(&mut self, _: u32) {
        self.interrupt = false;

        self.page_base_x = 0;
        self.page_base_y = 0;
        self.semi_transparency = 0;
        self.texture_depth = TextureDepth::T4Bit;
        self.texture_window_x_mask = 0;
        self.texture_window_y_mask = 0;
        self.texture_window_x_offset = 0;
        self.texture_window_y_offset = 0;
        self.dithering = false;
        self.draw_to_display = false;
        self.texture_disable = false;
        self.rectangle_texture_x_flip = false;
        self.rectangle_texture_y_flip = false;
        self.drawing_area_left = 0;
        self.drawing_area_top = 0;
        self.drawing_area_right = 0;
        self.drawing_area_bottom = 0;
        self.drawing_x_offset = 0;
        self.drawing_y_offset = 0;
        self.force_set_mask_bit = false;
        self.preserve_masked_pixels = false;

        self.dma_direction = DmaDirection::Off;

        self.display_disabled = true;
        self.display_vram_x_start = 0;
        self.display_vram_y_start = 0;
        self.hres = HorizontalRes::from_fields(0, 0);
        self.vres = VerticalRes::Y240Lines;

        self.vmode = VMode::Ntsc;
        self.interlaced = true;
        self.display_horiz_start = 0x200;
        self.display_horiz_end = 0xc00;
        self.display_line_start = 0x10;
        self.display_line_end = 0x100;
        self.display_depth = DisplayDepth::D15Bits;

        // XXX should also clear the command FIFO when we implement it
        // XXX should also invalidate GPU cache if we ever implement it
    }
}

The reset command is supposed to flush the command FIFO and the texture cache but we don't emulate those yet so I just added a note to remember to modify the function when we add support for one of those.

The texture_window_ parameters are used to crop a texture. The drawing_area_ parameters are used to describe a drawing window, the GPU won't draw anything outside of this area.

The drawing_offset_ parameters are a constant offset that's added to all the vertex. It lets you translate a scene in VRAM without having to recompute all the coordinates on the CPU.

The display_vram_, display_horiz_ and display_line_ parameters are used to describe which portion of the VRAM are drawn on the screen. If you're not familiar with the wonderful world of analog video it might not be immediately obvious what those parameters do so let me give a quick overview of the GPU's video output.

The GPU renderer and the video output

You can think of the Playstation GPU as two different modules operating asynchronously. First you have t he renderer which take the draw commands (through GP0) and rasterizes them into the dedicated video memory used by the GPU: the VRAM.

The VRAM is organized as a two dimensional byte array whose dimensions are 2048x512, giving a grand total of 1MB of video memory. This VRAM is used to store the image generated by the GPU's rasterizer (i.e. the framebuffer) but also any texture used to render the scene. The GPU has no direct access to the main RAM, much less the CDROM: all the assets have to be copied in VRAM by the CPU or DMA before the rendering can take place.

Once the renderer has completed a scene it ends up somewhere in the VRAM. Now it has to be displayed on the TV screen. That's where the GPU's video output is used.

The video output (when enabled) sends the video signal continuously at 60 NTSC or 50 PAL frames per second. It never stops because doing so would cause a glitch on the screen. Consider the CRT displays everybody used in the nineties: you have an electron beam sweeping the screen line by line, you can't jump to any random position of the screen when you want. Even on modern LCD screens most video interfaces (VGA, DVI, HDMI, LVDS, MIPI,...) behave in the same way.

That means that when the game wants to draw a triangle on the screen it doesn't directly send the triangle to the TV, rather it renders it in the framebuffer and the video output will send it to the screen during its next pass.

For the time being we won't bother emulating the video output, we can directly display the contents of the framebuffer. It's not accurate but it's simpler and we should be able to plug our video output emulation layer on top of it when we're ready to implement it.

GPUREAD register placeholder

After those commands the BIOS reads from the register at offset 0 in the GPU (the same address where GP0 commands are written). This register is GPUREAD and is used to retrieve data generated by certain commands, typically to read parts of the framebuffer back in RAM. The problem is that so far no such command has been issued so I'm not sure why the BIOS attempts to read from there. For now let's return 0 and we'll implement it properly later:

impl Gpu {
    //...

    /// Retrieve value of the "read" register
    pub fn read(&self) -> u32 {
        // Not implemented for now...
        0
    }
}

GP1 Display Mode command

The BIOS goes on by sending command 0x08000000 in GP1. Opcode 0x08 sets the display mode: video mode, screen resolution, interlacing etc... It also sets that weird field which we encountered as bit 14 of GPUSTAT: the one who appears to mess up the video output. I'm going to assume this field is useless so I'm just going to crash if it's set, this way if one game relies on it we're sure to catch it:

impl Gpu {
    //...

    /// GP1(0x80): Display Mode
    fn gp1_display_mode(&mut self, val: u32) {
        let hr1 = (val & 3) as u8;
        let hr2 = ((val >> 6) & 1) as u8;

        self.hres = HorizontalRes::from_fields(hr1, hr2);

        self.vres = match val & 0x4 != 0 {
            false => VerticalRes::Y240Lines,
            true  => VerticalRes::Y480Lines,
        };

        self.vmode = match val & 0x8 != 0 {
            false => VMode::Ntsc,
            true  => VMode::Pal,
        };

        self.display_depth = match val & 0x10 != 0 {
            false => DisplayDepth::D24Bits,
            true  => DisplayDepth::D15Bits,
        };

        self.interlaced = val & 0x20 != 0;

        if val & 0x80 != 0 {
            panic!("Unsupported display mode {:08x}", val);
        }
    }
}

GP1 DMA direction command

After that the BIOS issues the GP1 command 0x04000000. Opcode 0x04 simply sets the DMA direction (to Off in this case):

impl Gpu {
    //...

    /// GP1(0x04): DMA Direction
    fn gp1_dma_direction(&mut self, val: u32) {
        self.dma_direction =
            match val & 3 {
                0 => DmaDirection::Off,
                1 => DmaDirection::Fifo,
                2 => DmaDirection::CpuToGp0,
                3 => DmaDirection::VRamToCpu,
                _ => unreachable!(),
            };
    }
}

DMA GP0 commands

After that the CPU issues an other "DMA Direction" command to set it to value 2 (CpuToGp0). After that the BIOS starts sending the Linked List commands using the DMA. Those commands are always sent to GP0 so we can update our linked list DMA routine to call the gp0 method of our GPU:

impl Gpu {
    //...

    /// Emulate DMA transfer for linked list synchronization mode.
    fn do_dma_linked_list(&mut self, port: Port) {
        //...

        loop {
            //...

            while remsz > 0 {
                addr = (addr + 4) & 0x1ffffc;

                let command = self.ram.load32(addr);

                // Send command to the GPU
                self.gpu.gp0(command);

                remsz -= 1;
            }
        //...
        }
        //...
    }
}

GP0 Set Drawing Area commands

The first command sent through the linked list is 0xe3000400 which sets the top-left corner of the drawing area. When the GPU renderer draws to the framebuffer it won't write anything outside of the drawing area even if a draw command clips outside.

impl Gpu {
    //...

    /// GP0(0xE3): Set Drawing Area top left
    fn gp0_drawing_area_top_left(&mut self, val: u32) {
        self.drawing_area_top = ((val >> 10) & 0x3ff) as u16;
        self.drawing_area_left = (val & 0x3ff) as u16;
    }
}

You see that the drawing_area_top value can range from 0 to 1023. It's strange because the GPU VRAM only has 512 lines so anything beyond that value won't be rendered. The horizontal coordinate, drawing_area_left, has the same resolution but this one is normal: the VRAM has 2048 bytes per lines but since the GPU draws 16 bits per pixel (15bit RGB + mask bit) you can only fit 1024 pixels per VRAM line.

Unsuprisingly the next command is 0xe403c27f which sets the bottom-right corner of the drawing area, the parameter packing is the same:

impl Gpu {
    //...

    /// GP0(0xE4): Set Drawing Area bottom right
    fn gp0_drawing_area_bottom_right(&mut self, val: u32) {
        self.drawing_area_bottom = ((val >> 10) & 0x3ff) as u16;
        self.drawing_area_right = (val & 0x3ff) as u16;
    }
}

After those two commands the top-left corder is at [0, 1] while the bottom-right is at [639, 240]. The coordinates are inclusive so the drawing area resolution is 640x240 which looks like a standard NTSC field resolution.

GP0 Set Drawing Offset command

The BIOS continues setting the drawing area with command 0xe5000800 which sets the drawing offset. We have to be careful with that one because the x and y parameters are 11 bit signed two's complement values. It means that the GPU can handle negative offsets. We need to mess with a few bit shifts to get the correct sign extension for negative values32:

impl Gpu {
    //...

    /// GP0(0xE5): Set Drawing Offset
    fn gp0_drawing_offset(&mut self, val: u32) {
        let x = (val & 0x7ff) as u16;
        let y = ((val >> 11) & 0x7ff) as u16;

        // Values are 11bit two's complement signed values, we need to
        // shift the value to 16bits to force sign extension
        self.drawing_x_offset = ((x << 5) as i16) >> 5;
        self.drawing_y_offset = ((y << 5) as i16) >> 5;
    }
}

This particular command sets the offset to [0, 1] which matches the drawing area top-left corner so everything is coherent so far. I'm not sure why the BIOS doesn't start at [0, 0] but I guess wasting one line doesn't matter much for displaying the boot logo.

GP0 Texture Window command

After that we have yet another GPU config command: 0xe2000000 which configures the texture window parameters:

impl Gpu {
    //...

    /// GP0(0xE2): Set Texture Window
    fn gp0_texture_window(&mut self, val: u32) {
        self.texture_window_x_mask = (val & 0x1f) as u8;
        self.texture_window_y_mask = ((val >> 5) & 0x1f) as u8;
        self.texture_window_x_offset = ((val >> 10) & 0x1f) as u8;
        self.texture_window_y_offset = ((val >> 15) & 0x1f) as u8;
    }
}

GP0 Mask Bit Setting command

The BIOS continues with the last GP0 rendering attribute command: e6000000' which sets the mask bit-related parameters:

impl Gpu {
    //...

    /// GP0(0xE6): Set Mask Bit Setting
    fn gp0_mask_bit_setting(&mut self, val: u32) {
        self.force_set_mask_bit = (val & 1) != 0;
        self.preserve_masked_pixels = (val & 2) != 0;
    }
}

The mask bit behaves a bit like OpenGL's stencil masks, it prevents the GPU from overwriting a pixel if its mask bit is set and masking is enabled.

GP1 Display VRAM Start command

The BIOS then configures the video output through the GP1 register. It starts with 0x0503c400 which sets the display start address in VRAM.

Note that the LSB of the horizontal coordinate is ignored. It means that we're always aligned to a 16bit pixel.

impl Gpu {
    //...

    /// GP1(0x05): Display VRAM Start
    fn gp1_display_vram_start(&mut self, val: u32) {
        self.display_vram_x_start = (val & 0x3fe) as u16;
        self.display_vram_y_start = ((val >> 10) & 0x1ff) as u16;
    }
}

The current command sets the start coordinates to [0, 241] which is immediately below the drawing area we configured before. I assume it's because the BIOS will use a form of double buffering and won't draw directly to the displayed area.

GP1 Display Range commands

After the display VRAM start address the BIOS configures the video output timings with commands 0x06c60260 and 0x0703fc10 which respectively set the display's horizontal and vertical range33:

impl Gpu {
    //...

    /// GP1(0x06): Display Horizontal Range
    fn gp1_display_horizontal_range(&mut self, val: u32) {
        self.display_horiz_start = (val & 0xfff) as u16;
        self.display_horiz_end   = ((val >> 12) & 0xfff) as u16;
    }

    /// GP1(0x07): Display Vertical Range
    fn gp1_display_vertical_range(&mut self, val: u32) {
        self.display_line_start = (val & 0x3ff) as u16;
        self.display_line_end   = ((val >> 10) & 0x3ff) as u16;
    }
}

Note that those commands use a different packing format for their parameters.

GP0 Monochrome Quadrilateral command

We're finally getting to the interesting part: the first draw command. The BIOS sends a linked list to the GPU containing command 0x28000000. GP0 opcode 0x28 draws a monochrome quadrilateral. The low 3bytes of the command contain the 24bit BGR color of the polygon (black in this case).

There's a problem however: this command takes 4 additional words as argument containing the coordinates of the 4 vertex needed to draw a quadrilateral. So far we've only implemented single-word GP0 commands so we'll have to improve our code a little.

To simplify our task I'm going to start with a simple container that will accumulate the words for the current command:

/// Buffer holding multi-word fixed-length GP0 command parameters
struct CommandBuffer {
    /// Command buffer: the longuest possible command is GP0(0x3E)
    /// which takes 12 parameters
    buffer: [u32; 12],
    /// Number of words queued in buffer
    len:    u8,
}

impl CommandBuffer {
    fn new() -> CommandBuffer {
        CommandBuffer {
            buffer: [0; 12],
            len:    0,
        }
    }

    /// Clear the command buffer
    fn clear(&mut self) {
        self.len = 0;
    }

    fn push_word(&mut self, word: u32) {
        self.buffer[self.len as usize] = word;

        self.len += 1;
    }
}

impl ::std::ops::Index<usize> for CommandBuffer {
    type Output = u32;

    fn index<'a>(&'a self, index: usize) -> &'a u32 {
        if index >= self.len as usize {
            panic!("Command buffer index out of range: {} ({})",
                   index, self.len);
        }

        &self.buffer[index]
    }
}

It's just a glorified array which can contain up to 12 words and keeps the count of how many words have been pushed into it. The std::ops::Index mumbo jumbo just overloads the [] operator to let us access CommandBuffer elements like a regular array.

We can add an instance of this CommandBuffer to our GPU state and we'll also add a counter of the number of remaining parameters and a function pointer to the method which implements the command (it will save us having to match the opcode twice):

pub struct Gpu {
    //...

    /// Buffer containing the current GP0 command
    gp0_command: CommandBuffer,
    /// Remaining words for the current GP0 command
    gp0_command_remaining: u32,
    /// Pointer to the method implementing the current GP) command
    gp0_command_method: fn(&mut Gpu),
}

We can now modify our GP0 register handler to use this new infrastructure:

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        if self.gp0_command_remaining == 0 {
            // We start a new GP0 command
            let opcode = (val >> 24) & 0xff;

            let (len, method) =
                match opcode {
                    0x00 =>
                        (1, Gpu::gp0_nop
                            as fn(&mut Gpu)),
                    0x28 =>
                        (5, Gpu::gp0_quad_mono_opaque
                            as fn(&mut Gpu)),
                    0xe1 =>
                        (1, Gpu::gp0_draw_mode
                            as fn(&mut Gpu)),
                    0xe2 =>
                        (1, Gpu::gp0_texture_window
                            as fn(&mut Gpu)),
                    0xe3 =>
                        (1, Gpu::gp0_drawing_area_top_left
                            as fn(&mut Gpu)),
                    0xe4 =>
                        (1, Gpu::gp0_drawing_area_bottom_right
                            as fn(&mut Gpu)),
                    0xe5 =>
                        (1, Gpu::gp0_drawing_offset
                            as fn(&mut Gpu)),
                    0xe6 =>
                        (1, Gpu::gp0_mask_bit_setting
                            as fn(&mut Gpu)),
                    _ => panic!("Unhandled GP0 command {:08x}",
                                val),
                };

            self.gp0_command_remaining = len;
            self.gp0_command_method = method;

            self.gp0_command.clear();
        }

        self.gp0_command.push_word(val);
        self.gp0_command_remaining -= 1;

        if self.gp0_command_remaining == 0 {
            // We have all the parameters, we can run the command
            (self.gp0_command_method)(self);
        }
    }

    /// GP0(0x00): No Operation
    fn gp0_nop(&mut self) {
        // NOP
    }
}

We're still missing the implementation of the gp0_quad_mono_opaque function that's supposed to render the primitive in the framebuffer. We could start drawing to the screen right away but since we only have a black rectangle so far it wouldn't be very interesting. Let's put a placeholder for now and continue a little further before we fire up OpenGL:

impl Gpu {
    //...

    /// GP0(0x28): Monochrome Opaque Quadrilateral
    fn gp0_quad_mono_opaque(&mut self) {
        println!("Draw quad");
    }
}

Interleaved video deadlock workaround

Unfortunately when we attempt to continue the execution to get to the next GPU commands we enter an infinite loop in the BIOS. That's weird, especially since we got way past that point before we started implementing the GPU.

If we disassemble the code at the deadlock location we discover that the BIOS is apparently waiting for bit 31 of the GPUSTAT register to change. This bit is supposed to alternate between odd and even lines when the output is interlaced). But we haven't implemented this bit yet.

So why did it work before? After some testing it turns out it's because this particular piece of code is only entered when GPUSTAT retuns bit 19 set, i.e. when the vertical resolution is set to 480 lines. Paradoxically by implementing the GPUSTAT register and improving the accuracy of our emulator we caused a regression.

I'm not entirely sure what this particular piece of BIOS code does to be honest, to figure it out I'd have to disassemble a bigger chunk of surrounding code to figure out what it's trying to do and I don't want to go down this rabbit hole at that point. I'd rather implement bit 31 of GPUSTAT correctly but in order to do that we need accurate GPU timings and we don't have that yet.

In that light and in order to keep us moving we're going to use a temporary hack: in the GPUSTAT register we're going to return 0 in bit 19 no matter what. It's not accurate but it will side step that problemetic piece of code. When we implement GPU timings and emulate bit 31 correctly we'll revert that change:

impl Gpu {
    //...

    /// Retrieve value of the status register
    pub fn status(&self) -> u32 {
        let mut r = 0u32;

        //...
        r |= self.hres.into_status();
        // XXX Temporary hack: if we don't emulate bit 31 correctly
        // setting `vres` to 1 locks the BIOS:
        // r |= (self.vres as u32) << 19;
        r |= (self.vmode as u32) << 20;
    //...
    }
}

This is not very satisfactory of course but that should allow us to keep going with our first GPU implementation. Soon after that we'll start working on accurate timings and we'll be able to emulate bit 31 properly.

GP0 Clear Cache command

We can now resume the BIOS execution and we reach a new GP0 command: 0x01000000. This command is used to clear the internal texture cache. Since we don't implement a texture cache yet we can just ignore this command for now:

impl Gpu {
    //...

    /// GP0(0x01): Clear Cache
    fn gp0_clear_cache(&mut self) {
        // Not implemented
    }
}

GP0 Load Image command

Next we have GP0 command 0xa0000000 which is used to load an image into the GPU's VRAM using the CPU or the DMA. This is how a program can load a texture or a palette into the GPU's dedicated memory.

The command takes two additional word parameters: the first one contains the coordinates of the top-left corner of target rectangle in the VRAM. The 2nd one contains the resolution of the image (width/heigh) in pixels. The GPU then expects the pixel data on the same GP0 port.

Since the GPU uses 16bits pixels and the CPU/DMA send 32bits at a time to the GP0 port an additional 16bits of padding must be added in the total number of pixels is odd.

Since this command is immediately followed by the image data the total amount of data transfered can be quite big, storing all of it in our gp0_command buffer to copy it all in the VRAM afterwards would be wasteful. Instead we are going to special case image transfer to store the data directly inside the VRAM.

I add a new variable in the GPU state holding the current mode of the GP0 port:

pub struct Gpu {
    //...

    /// Current mode of the GP0 register
    gp0_mode: Gp0Mode,
}

impl Gpu {
    pub fn new() -> Gpu {
        Gpu {
            //...

            gp0_mode: Gp0Mode::Command,
        }
    }

    //...
}

/// Possible states for the GP0 command register
enum Gp0Mode {
    /// Default mode: handling commands
    Command,
    /// Loading an image into VRAM
    ImageLoad,
}

I also renamed gp0_command_remaining into gp0_words_remaining since it will also count the remaining number of image words to load.

We can then tweak the gp0 method to handle this new mode:

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        if self.gp0_words_remaining == 0 {
            // We start a new GP0 command
            let opcode = (val >> 24) & 0xff;

            let (len, method) =
                match opcode {
                    //...
                    0xa0 =>
                        (3, Gpu::gp0_image_load as fn(&mut Gpu)),
                    //...
                };

            self.gp0_words_remaining = len;
            self.gp0_command_method = method;

            self.gp0_command.clear();
        }

        self.gp0_words_remaining -= 1;

        match self.gp0_mode {
            Gp0Mode::Command => {
                self.gp0_command.push_word(val);

                if self.gp0_words_remaining == 0 {
                    // We have all the parameters, we can run
                    // the command
                    (self.gp0_command_method)(self);
                }
            }
            Gp0Mode::ImageLoad => {
                // XXX Should copy pixel data to VRAM

                if self.gp0_words_remaining == 0 {
                    // Load done, switch back to command mode
                    self.gp0_mode = Gp0Mode::Command;
                }
            }
        }
    }
}

I added the gp0_image_load command which I consider to be 3 words long. The method uses those parameters to compute the number of words we must expect as part of the image data and puts it back in gp0_words_remaining while switching gp0_mode to ImageLoad:

impl Gpu {
    //...

    /// GP0(0XA0): Image Load
    fn gp0_image_load(&mut self) {
        // Parameter 2 contains the image resolution
        let res = self.gp0_command[2];

        let width  = res & 0xffff;
        let height = res >> 16;

        // Size of the image in 16bit pixels
        let imgsize = width * height;

        // If we have an odd number of pixels we must round up
        // since we transfer 32bits at a time. There'll be 16bits
        // of padding in the last word.
        let imgsize = (imgsize + 1) & !1;

        // Store number of words expected for this image
        self.gp0_words_remaining = imgsize / 2;

        // Put the GP0 state machine in ImageLoad mode
        self.gp0_mode = Gp0Mode::ImageLoad;
    }
}

Of course in my gp0 implementation above I don't actually do anything with the image data. When we add support for the VRAM emulation we should copy the image data at the right location (given by the first command parameter) but there's no reason to bother with that at that point.

DMA image transfer

The BIOS doesn't send the image data in a linked list like other GP0 commands, instead it uses a regular "block" DMA transfer so we need to plug our gp0 method in it:

impl Interconnect {
    //...

    /// Emulate DMA transfer for Manual and Request synchronization
    /// modes.
    fn do_dma_block(&mut self, port: Port) {
        //...

        while remsz > 0 {
            //...

            match channel.direction() {
                Direction::FromRam => {
                    let src_word = self.ram.load32(cur_addr);

                    match port {
                        Port::Gpu => self.gpu.gp0(src_word),
                        _ =>
                        panic!("Unhandled DMA destination port {}",
                               port as u8),
                    }
                }
                //...
            }

            //...
        }

        //...
    }
}

Our emulator now loads textures to the GPU and then discards them immediately. Beautiful.

GP1 Display Enable command

After that the bios issues GP1 command 0x03000000 which is used to set the value of of our display_disabled field:

impl Gpu {
    //...

    /// GP1(0x03): Display Enable
    fn gp1_display_enable(&mut self, val: u32) {
        self.display_disabled = val & 1 != 0;
    }
}

In this case it sets it to 0, effectively enabling the display.

GP0 Image Store command

Then the BIOS does something quite perplexing: it issues a 0xc0000000 command on GP0 which is used to copy data from the VRAM to the CPU/DMA. The parameters are the same as the "Load Image" command but this time it's the GPU providing the pixel data through the GPUREAD register. The CPU (or DMA) can then read the data 32bits at a time by reading this register until all the image has been transfered.

I find it perplexing because I can't imagine what the BIOS is trying to do here. Since it hasn't rendered anything worthwhile yet it doesn't have anything interesting to recover. Maybe it's part of a self-test of some sort.

We could try to figure it out by disassembling the code that calls this command but I don't want to bother with that. I'm just going to foolishly ignore it for now. Whatever this code is doing doesn't seem to prevent the BIOS from continuing its execution normally (it gets through the boot animation and starts probing the CDROM). So let's use a placeholder implementation for now:

impl Gpu {
    //...

    /// GP0(0xC0): Image Store
    fn gp0_image_store(&mut self) {
        // Parameter 2 contains the image resolution
        let res = self.gp0_command[2];

        let width  = res & 0xffff;
        let height = res >> 16;

        println!("Unhandled image store: {}x{}", width, height);
    }
}

We don't have to do anything more: after this command the BIOS will expect the image data to be available through the GPUREAD register. Right now our implementation of this register always returns 0 so it will read that as many times as it wants.

GP0 Shaded Quadrilateral command

At long last we're reaching the interesting part! The BIOS is starting to draw the boot animation with the "Sony Computer Entertainment" logo. We're only missing a few commands before we can proceed to implement the OpenGL renderer itself.

The first one is 0x380000b2 which draws a shaded quadrilateral. It means that unlike the previous quad command this one takes one color per vertex and fills the shape with a Gouraud shading which creates a gradient between those values. We'll see that this type of shading is trivial to implement in OpenGL.

This command takes 8 parameters: 4 vertex position and their assorted colors. As for the other drawing commands let's put a placeholder for the moment:

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        if self.gp0_words_remaining == 0 {
            //...

            let (len, method) =
                match opcode {
                    //...
                    0x38 =>
                        (8, Gpu::gp0_quad_shaded_opaque
                            as fn(&mut Gpu)),
                    //...
                };
            //...
        }
        //...
    }

    /// GP0(0x38): Shaded Opaque Quadrilateral
    fn gp0_quad_shaded_opaque(&mut self) {
        println!("Draw quad shaded");
    }
}

GP0 Shaded Triangle command

After that we get to the GP0 command 0x300000b2. This command is almost identical to the one before except that it draws a triangle instead of a quad. As such it only takes 6 parameters (3 position/color couples):

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        if self.gp0_words_remaining == 0 {
            //...

            let (len, method) =
                match opcode {
                    //...
                    0x30 =>
                        (6, Gpu::gp0_triangle_shaded_opaque
                            as fn(&mut Gpu)),
                    //...
                };
            //...
        }
        //...
    }

    /// GP0(0x30): Shaded Opaque Triangle
    fn gp0_triangle_shaded_opaque(&mut self) {
        println!("Draw triangle shaded");
    }
}

GP0 Textured Quadrilateral With Color Blending command

Our last drawing command for the moment will be 0x2c808080 which is the fanciest yet: it draws a quadrilateral, maps a texture on it while blending it with a solid color. It takes 9 parameters:

impl Gpu {
    //...

    /// Handle writes to the GP0 command register
    pub fn gp0(&mut self, val: u32) {
        if self.gp0_words_remaining == 0 {
            //...

            let (len, method) =
                match opcode {
                    //...
                    0x2c =>
                        (9, Gpu::gp0_quad_texture_blend_opaque
                            as fn(&mut Gpu)),
                    //...
                };
            //...
        }
        //...
    }

    /// GP0(0x2C): Textured Opaque Quadrilateral
    fn gp0_quad_texture_blend_opaque(&mut self) {
        println!("Draw quad texture blending");
    }
}

GP1 Acknowledge Interrupt command

Once the BIOS has finished displaying the boot logo it attempts to acknowledge the GPU interrupt by issuing the GP1 command 0x02000000. Of course in our current implementation we never trigger the interrupt in the first place but we might as well add a simple implementation anyway:

impl Gpu {
    //...

    /// GP1(0x02): Acknowledge Interrupt
    fn gp1_acknowledge_irq(&mut self) {
        self.interrupt = false;
    }
}

GP1 Reset Command Buffer command

And we finish this sequence with GP1 command 0x01000000 which clears the command FIFO. We don't implement the FIFO itself yet but we can at least reset the GP0 state machine to a default state:

impl Gpu {
    //...

    /// GP1(0x01): Reset Command Buffer
    fn gp1_reset_command_buffer(&mut self) {
        self.gp0_command.clear();
        self.gp0_words_remaining = 0;
        self.gp0_mode = Gp0Mode::Command;
        // XXX should also clear the command FIFO when we
        // implement it
    }
}

I take the opportunity add a call to this function in gp1_reset since it should also clear the command buffer.

And that's it! We have our entire GPU command sequence to display the boot logo. Now we can implement a basic OpenGL renderer to visualize it all.

The GPU: Basic OpenGL renderer for the boot logo

For our first renderer we're not going to bother with the Video Display: since the GPU's internal video memory has a total resolution of 1024x512 we'll just display all of the framebuffer at once and draw the primitives directly on the screen. We just need to take the draw commands, render them in our framebuffer in the native internal resolution and display it all. Easy.

The first step is to create a window and retreive an OpenGL context to draw in it. OpenGL itself doesn't handle things like window creations since that's system specific. There are many libraries out there to take care of that (GLFW, freeglut, etc...). For my emulator I opted for the SDL2 library, mainly because I'm familiar with it. I'll also use this library to handle controller input and, later on, sound support.

If for some reason you prefer to use an other library (or libraries) to handle these system-spefic interfaces rest assured that it won't change anything to the OpenGL code itself, just the window setup code.

Window and OpenGL context creation

Here's the code for creating a window and recovering its OpenGL context with the SDL2:

use sdl2;
use sdl2::video::{OPENGL, WindowPos};
use sdl2::video::GLAttr::GLContextMajorVersion;
use sdl2::video::GLContextMinorVersion;
use gl;
use libc::c_void;

pub struct Renderer {
    sdl_context: sdl2::sdl::Sdl,
    window: sdl2::video::Window,
    gl_context: sdl2::video::GLContext,
}

impl Renderer {

    pub fn new() -> Renderer {
        let sdl_context = sdl2::init(::sdl2::INIT_VIDEO).unwrap();

        sdl2::video::gl_set_attribute(GLContextMajorVersion, 3);
        sdl2::video::gl_set_attribute(GLContextMinorVersion, 3);

        let window = sdl2::video::Window::new(
                                     &sdl_context,
                                     "PSX",
                                     WindowPos::PosCentered,
                                     WindowPos::PosCentered,
                                     1024, 512,
                                     PENGL).unwrap();

        let gl_context = window.gl_create_context().unwrap();

        gl::load_with(|s|
                      sdl2::video::gl_get_proc_address(s).unwrap()
                      as *const c_void);

        Renderer {
            sdl_context: sdl_context,
            window: window,
            gl_context: gl_context,
        }
    }
}

The function sdl2::init calls the global SDL2 initialization routine. For now we're only using the VIDEO subsystem. In the SDL2 C API this function doesn't return anything but the rust bindings return an object that's used to call SDL_Quit automatically when it's destroyed. In C you have to call SDL_Quit explicitly when your program exits (or whenever you don't need the SDL anymore).

After that the two gl_set_attribute calls say that we're going to use OpenGL 3.334.

Then Window::new creates the window itself with a resolution of 1024x512 (the resolution of the VRAM) and OpenGL support. I named the window "PSX" because I don't have any imagination.

We can retreive the window's OpenGL context with the gl_create_context method and then we must load the OpenGL function pointers. You don't really need to understand that part in details, it's some glue between the OpenGL and SDL libraries, you just need to make sure it's done before we start calling OpenGL commands.

Finally we store the SDL context, window and OpenGL context in the newly created Renderer object. We need to put an instance of this struct in our GPU:

pub struct Gpu {
    //...

    /// OpenGL renderer
    renderer: opengl::Renderer,
}

impl Gpu {
    pub fn new() -> Gpu {
        Gpu {
            //...
            renderer: opengl::Renderer::new(),
        }
    }
}

If everything works well our emulator should now create a window when starting up. The window's contents are garbage however (on my system it contains a chunk of the screen). We can clear it by issuing the following calls:

impl Renderer {

    pub fn new() -> Renderer {
        //...

        // Clear the window
        unsafe {
            gl::ClearColor(0., 0., 0., 1.0);
            gl::Clear(gl::COLOR_BUFFER_BIT);
        }

        window.gl_swap_window();

        Renderer {
            //...
        }
    }
}

The unsafe keyword is there because as far as Rust is concerned all OpenGL calls are a C foreign function interface and are therefore potentially memory unsafe. The ClearColor35function sets the clear color (duh): the first three parameters are the red, green and blue components and the fourth is the alpha parameter. They all are floating point integers in the range [0.0, 1.0]. In this case all the color components are 0.0 so the color is black and alpha is set to 1.0 which means it's fully opaque.

The Clear function then applies this color to the entire color buffer. You'll notice that we just give the type of buffer we want to clear as parameter, not a handle to a specific buffer. That's the way most of the OpenGL API works: you "bind" various types of object to an implicit global context and the subsequent function calls act on the currently bound object of for a given type. In this case we haven't bound anything ourselves, by default the color buffer will be the window's framebuffer.

The gl_swap_window forces a window update and displays the result of the previous commands. With this addition the window should now appear completely black. Progress!

Drawing the primitives

Now let's do something more interesting: drawing the primitives. This is the part where we'll have to write a whole lot of OpenGL glue so take a deep breath and dive in.

Let's choose a primitive to start with, I've decided to use GP0(0x30),the gouraud shaded triangle. It's a simple shape with some basic shading. It has three vertex, each having a position in VRAM and a color. Let's create structs to hold those attributes in a shader-friendly fashion:

/// Position in VRAM.
#[derive(Copy,Clone,Default,Debug)]
pub struct Position(pub GLshort, pub GLshort);

impl Position {
    /// Parse position from a GP0 parameter
    pub fn from_gp0(val: u32) -> Position {
        let x = val as i16;
        let y = (val >> 16) as i16;

        Position(x as GLshort, y as GLshort)
    }
}

/// RGB color
#[derive(Copy,Clone,Default,Debug)]
pub struct Color(pub GLubyte, pub GLubyte, pub GLubyte);

impl Color {
    /// Parse color from a GP0 parameter
    pub fn from_gp0(val: u32) -> Color {
        let r = val as u8;
        let g = (val >> 8) as u8;
        let b = (val >> 16) as u8;

        Color(r as GLubyte, g as GLubyte, b as GLubyte)
    }
}

I store the Position as a pair of of GLshorts, OpenGL's signed 16bit integer type. The color is stored as a triplet of unsigned bytes, GLubyte. Internally OpenGL uses floats for screen coordinates and colors but we'll be able to make the conversion in the shaders.

We can use these new types to create two arrays: one will contain the three vertex positions of the triangle in VRAM, the other the associated colors:

impl Gpu {
    //...

    /// GP0(0x30): Shaded Opaque Triangle
    fn gp0_triangle_shaded_opaque(&mut self) {
        let positions = [
            Position::from_gp0(self.gp0_command[1]),
            Position::from_gp0(self.gp0_command[3]),
            Position::from_gp0(self.gp0_command[5]),
            ];

        let colors = [
            Color::from_gp0(self.gp0_command[0]),
            Color::from_gp0(self.gp0_command[2]),
            Color::from_gp0(self.gp0_command[4]),
            ];

        self.renderer.push_triangle(positions, colors);
    }
}

Now we need to implement this push_triangle method that will put the attributes in a list of vertex to render. That's where the fun begins.

First we need to setup somme buffers to hold the data. There are several ways to send data to the GPU, I've decided to go with persistently mapped buffers. The idea is that we're going to ask OpenGL to allocate some memory that will be shared between the GPU and us. We'll fill it with our data and when we're ready we'll tell the GPU to use it to draw the scene. Easy.

To avoid duplicating a bunch of code let's make a generic Buffer struct holding an attribute buffer and its mapping:

// Write only buffer with enough size for VERTEX_BUFFER_LEN elements
pub struct Buffer<T> {
    /// OpenGL buffer object
    object: GLuint,
    /// Mapped buffer memory
    map: *mut T,
}

impl<T: Copy + Default> Buffer<T> {
    /// Create a new buffer bound to the current vertex array
    /// object.
    pub fn new() -> Buffer<T> {
        let mut object = 0;
        let mut memory;

        unsafe {
            // Generate the buffer object
            gl::GenBuffers(1, &mut object);

            // Bind it
            gl::BindBuffer(gl::ARRAY_BUFFER, object);

            // Compute the size of the buffer
            let element_size = size_of::<T>() as GLsizeiptr;
            let buffer_size = element_size * VERTEX_BUFFER_LEN as GLsizeiptr;

            // Write only persistent mapping. Not coherent!
            let access = gl::MAP_WRITE_BIT | gl::MAP_PERSISTENT_BIT;

            // Allocate buffer memory
            gl::BufferStorage(gl::ARRAY_BUFFER,
                              buffer_size,
                              ptr::null(),
                              access);

            // Remap the entire buffer
            memory = gl::MapBufferRange(gl::ARRAY_BUFFER,
                                        0,
                                        buffer_size,
                                        access) as *mut T;

            // Reset the buffer to 0 to avoid hard-to-reproduce bugs
            // if we do something wrong with unitialized memory
            let s = slice::from_raw_parts_mut(memory,
                                              VERTEX_BUFFER_LEN as usize);

            for x in s.iter_mut() {
                *x = Default::default();
            }
        }

        Buffer {
            object: object,
            map: memory,
        }
    }

    /// Set entry at `index` to `val` in the buffer.
    pub fn set(&mut self, index: u32, val: T) {
        if index >= VERTEX_BUFFER_LEN {
            panic!("buffer overflow!");
        }

        unsafe {
            let p = self.map.offset(index as isize);

            *p = val;
        }
    }
}

impl<T> Drop for Buffer<T> {
    fn drop(&mut self) {
        unsafe {
            gl::BindBuffer(gl::ARRAY_BUFFER, self.object);
            gl::UnmapBuffer(gl::ARRAY_BUFFER);
            gl::DeleteBuffers(1, &self.object);
        }
    }
}

/// Maximum number of vertex that can be stored in an attribute
/// buffers
const VERTEX_BUFFER_LEN: u32 = 64 * 1024;

That's a lot of code to simply allocate a buffer! Let's walk through it:

  • First GenBuffers creates a new buffer object. That doesn't allocate the buffer memory, it basically just creates a handle.

  • This handle is then bound with BindBuffer, from then on the commands targetting ARRAY_BUFFER will use this buffer.

  • We must then compute the size of the buffer in bytes. I've decided to hardcode the length of the buffer in VERTEX_BUFFER_LEN, ideally it should be big enough to hold an entire scene (otherwise we'll have to make several draw calls per frame), but not too big in order not to waste memory. We'll probably want to better tune that constant later.

  • Once we know how much room we need we can ask OpenGL to allocate it for us. We request MAP_WRITE_BIT since we want to write-only access to the buffer and MAP_PERSISTENT_BIT to be able to hold the mapping persistently (instead of having to remap it for each frame).

  • Now we can retreive a pointer to this memory location using MapBufferRange to remap the buffer in the process' address space.

  • To make debugging easier if we mess something up I then reset the buffer's memory to zero. This way if we attempt to draw an unused part of the buffer by mistake we'll still have a well defined behaviour instead of drawing random unitialized data.

  • The set method will be used to store an entry in the buffer.

  • The Drop destructor will cleanup everything when we're done.

We can add our two buffers to the Renderer right now but creating buffers without having any shaders to render them isn't very useful.

The vertex shader

If you're not familiar with the concept of shaders you should take the time to read about them a little before we continue. Basically they're programs executed by various GPU stages. We'll only need two shaders for now: the vertex shader and the fragment shader.

The vertex shader is the first programmable stage in the OpenGL pipeline. It will receive the vertex coordinates and the colors from our attribute buffers. It'll have to convert them from the Playstation VRAM representation to the one used by OpenGL and pass them on to the next stage:

#version 330 core

in ivec2 vertex_position;
in uvec3 vertex_color;

out vec3 color;

void main() {
  // Convert VRAM coordinates (0;1023, 0;511) into
  // OpenGL coordinates (-1;1, -1;1)
  float xpos = (float(vertex_position.x) / 512) - 1.0;
  // VRAM puts 0 at the top, OpenGL at the bottom,
  // we must mirror vertically
  float ypos = 1.0 - (float(vertex_position.y) / 256);

  gl_Position.xyzw = vec4(xpos, ypos, 0.0, 1.0);

  // Convert the components from [0;255] to [0;1]
  color = vec3(float(vertex_color.r) / 255,
               float(vertex_color.g) / 255,
               float(vertex_color.b) / 255);
}

OpenGL shader language, also called GLSL, looks a bit like C but don't let that fool you, it's actually quite different. For one you can see that the parameters and return values are not given in the main prototype, instead they're given at the global scope as in and out parameters.

We have two in parameters: the vertex position (a pair of signed integers) and its color (a triplet of unsigned integers). The main function is called once for each vertex. Our triangle as three vertices so it'll be called 3 times.

The shader sets two output variables: color (a triplet of three floats) and gl_Position which is a a builtin GLSL variable, a vector of four floats. The last two components of gl_Position are the z (depth) coordinate which is always 0 for us since we're drawing in 2D and the w parameter (the homogeneous component) which should be 1.0 for a position. This last parameter is used for perspective correct projection36.

You can see that the OpenGL horizontal and vertical screen coordinates go from -1.0 to 1.0 (no matter the actual resolution of the screen) and that the vertical coordinates go in the opposite direction than the Playstation VRAM addressing. OpenGL colors are also floats in the range [0.0, 1.0].

You can see that our vertex shader does all the work of converting coordinates and colors from the Playstation internal representation to the OpenGL format. In general we'll want to offload as much computation as possible to the GPU since I'm expecting the emulation bottleneck to be on the CPU.

After all Playstation graphics are extremely simple compared to modern games, for instance modern GPUs have gigabytes worth of video RAM compared to the Playstation's puny 2MB. Even if we enhance the graphics significantly our graphic cards shouldn't break a sweat if we're careful not to write extremely poorly optimized shader code.

The fragment shader

Once the primitive passed through the vertex shader it will be rasterized37. In the rasterization process the triangle primitive is converted into individual fragments. In our case the fragments will be the individual screen pixels but with multisampling enabled you can get several fragments per pixels that get averaged to produce the final pixel value.

For each fragment in the rasterized primitive the GPU then runs the fragment shader whose job is to produce the fragment's color:

#version 330 core

in vec3 color;
out vec4 frag_color;

void main() {
    frag_color = vec4(color, 1.0);
}

Pretty straightforward: the output color frag_color (the name is arbitrary) takes the value of the input attribute color and a fourth value which is the alpha channel to handle transparent pixels. In our case the pixels are fully opaque so it's hardcoded to 1.0.

If you're not familiar with OpenGL you're probably puzzled, what's the value of this color parameter exactly? A triangle has three vertices, potentially each with a different color, so which one do we get here?

What happens is that in this case OpenGL tells the GPU to interpolate the value of the color based on its distance to the three vertices and their respective color. That means that we'll get a smooth gradient which is exactly what we need for the gouraud shading. OpenGL does all the hard work for us!

OpenGL shaded RGB triangle

[[fig:opengl_shaded_triangle]]{#fig:opengl_shaded_triangle label="fig:opengl_shaded_triangle"}

Figure [fig:opengl_shaded_triangle]{reference-type="ref" reference="fig:opengl_shaded_triangle"} shows an example of a triangle rendered with our fragment shader: each of the three vertex is colored using one of the RGB colors and we can see that the GPU interpolates the gradient for each of the pixels inside the triangle.

Compiling and linking the shaders

We can now piece our shaders together in our Renderer:

pub struct Renderer {
    //...

    /// Vertex shader object
    vertex_shader: GLuint,
    /// Fragment shader object
    fragment_shader: GLuint,
    /// OpenGL Program object
    program: GLuint,
    /// OpenGL Vertex array object
    vertex_array_object: GLuint,
    /// Buffer containing the vertice positions
    positions: Buffer<Position>,
    /// Buffer containing the vertice colors
    colors: Buffer<Color>,
    /// Current number or vertices in the buffers
    nvertices: u32,
}


impl Renderer {

    pub fn new() -> Renderer {
        //...

        // "Slurp" the contents of the shader files. Note: this is
        // a compile-time thing.
        let vs_src = include_str!("vertex.glsl");
        let fs_src = include_str!("fragment.glsl");

        // Compile our shaders...
        let vertex_shader   = compile_shader(vs_src,
                                             gl::VERTEX_SHADER);
        let fragment_shader = compile_shader(fs_src,
                                             gl::FRAGMENT_SHADER);
        // ... Link our program...
        let program = link_program(&[vertex_shader,
                                     fragment_shader]);
        // ... And use it.
        unsafe {
            gl::UseProgram(program);
        }

        // Generate our vertex attribute object that will hold our
        // vertex attributes
        let mut vao = 0;
        unsafe {
            gl::GenVertexArrays(1, &mut vao);
            // Bind our VAO
            gl::BindVertexArray(vao);
        }

        // Setup the "position" attribute. First we create
        // the buffer holding the positions (this call also
        // binds it)
        let positions = Buffer::new();

        unsafe {
            // Then we retreive the index for the attribute in the
            // shader
            let index = find_program_attrib(program,
                                            "vertex_position");

            // Enable it
            gl::EnableVertexAttribArray(index);

            // Link the buffer and the index: 2 GLshort attributes,
            // not normalized. That should send the data untouched
            // to the vertex shader.
            gl::VertexAttribIPointer(index, 2, gl::SHORT, 0, ptr::null());
        }

        // Setup the "color" attribute and bind it
        let colors = Buffer::new();

        unsafe {
            let index = find_program_attrib(program,
                                            "vertex_color");
            gl::EnableVertexAttribArray(index);

            // Link the buffer and the index: 3 GLByte attributes,
            // not normalized. That should send the data untouched
            // to the vertex shader.
            gl::VertexAttribIPointer(index,
                                     3,
                                     gl::UNSIGNED_BYTE,
                                     0,
                                     ptr::null());
        }

        Renderer {
            sdl_context: sdl_context,
            window: window,
            gl_context: gl_context,
            vertex_shader: vertex_shader,
            fragment_shader: fragment_shader,
            program: program,
            vertex_array_object: vao,
            positions: positions,
            colors: colors,
            nvertices: 0,
        }
    }
    //...
}

Quite a lot of code to go through here. I put the code for our two shaders described earlier in two files named "vertex.glsl" and "fragment.glsl" respectively. I retreive their contents here using Rust's include_str directive. Then I ask OpenGL to compile both shaders using the compile_shader helper function:

pub fn compile_shader(src: &str, shader_type: GLenum) -> GLuint {
    let shader;

    unsafe {
        shader = gl::CreateShader(shader_type);
        // Attempt to compile the shader
        let c_str = CString::new(src.as_bytes()).unwrap();
        gl::ShaderSource(shader, 1, &c_str.as_ptr(), ptr::null());
        gl::CompileShader(shader);

        // Extra bit of error checking in case we're not using a
        // DEBUG OpenGL context and check_for_errors can't do it
        // properly:
        let mut status = gl::FALSE as GLint;
        gl::GetShaderiv(shader, gl::COMPILE_STATUS, &mut status);

        if status != (gl::TRUE as GLint) {
            panic!("Shader compilation failed!");
        }
    }

    shader
}

The compilation is always done at runtime when we start the emulator.

Once the shaders are compiled we must link them together to form a complete OpenGL "program". This is done by the link_program helper function:

pub fn link_program(shaders: &[GLuint]) -> GLuint {
    let program;

    unsafe {
        program = gl::CreateProgram();

        for &shader in shaders {
            gl::AttachShader(program, shader);
        }

        gl::LinkProgram(program);

        // Extra bit of error checking in case we're not using a
        // DEBUG OpenGL context and check_for_errors can't do it
        // properly:
        let mut status = gl::FALSE as GLint;
        gl::GetProgramiv(program, gl::LINK_STATUS, &mut status);

        if status != (gl::TRUE as GLint) {
            panic!("OpenGL program linking failed!");
        }
    }

    program
}

Once the program is linked UseProgram activates it. We can then setup our position and color attributes.

Vertex array objects

First we need to create a "vertex array object" (VAOs) to old the attributes. The idea is that if you have different sets of attributes in your application and you want to be able to switch rapidly you create one vertex array object per set and you can then switch between them with a single call (instead of one call per attribute).

We don't really need more than one set at that point so we just create a single one with GenVertexArrays and bind it with BindVertexArray.

At last we use our Buffer struct to initialize the positions buffer. We then need to associate it with the vertex_position attribute in the vertex shader. In order to do this we use the find_program_attrib function to recover the attribute index is the 'program':

/// Return the index of attribute `attr` in `program`. Panics if
/// the index isn't found.
pub fn find_program_attrib(program: GLuint, attr: &str) -> GLuint {
    let cstr = CString::new(attr).unwrap().as_ptr();

    let index = unsafe { gl::GetAttribLocation(program, cstr) };

    if index < 0 {
        panic!("Attribute \"{}\" not found in program", attr);
    }

    index as GLuint
}

We must then enable the attribute with EnableVertexAttribArray and we describe the format of the buffer with VertexAttribIPointer. This last call is very important to get right, otherwise the program's behavior will be potentially undefined:

  • The first parameter is the index of the attribute in the program.

  • The second parameter contains the number of elements per vertex in the buffer. For the position we have the x and y coordinates, so that's two. It matches our declaration in the vertex shader since we used an ivec2 to hold this value.

  • The third parameter is the type of each element. It must match the type we're using to represent the values in our rust code. In this case we're using GLshorts to hold the coordinates so we set it to SHORT.

  • The fourth parameter is the "stride" which is a number of bytes the GPU will skip between each value. Since we don't have any padding in our buffer we set it to 0.

  • The last parameter is an optional pointer to some data that will be copied as the initial value of the attribute buffer. We don't have any data to put in at that point (and we could do it through our Buffer mapping if we wanted anyway) so we set it to NULL.

After this call our position buffer will be ready for use!

We then go through the same sequence for our color buffer, the only difference being the parameters to the VertexAttribIPointer call: this time we have three values per vertex and the type is UNSIGNED_BYTE.

Finally I put it all in the Renderer struct along with an nvertices variable that will hold the current number of vertices ready to be drawn in the vertex buffers.

In order to clean everything up properly when we exit we need a destructor to release the resources:

impl Drop for Renderer {
    fn drop(&mut self) {
        unsafe {
            gl::DeleteVertexArrays(1, &self.vertex_array_object);
            gl::DeleteShader(self.vertex_shader);
            gl::DeleteShader(self.fragment_shader);
            gl::DeleteProgram(self.program);
        }
    }
}

OpenGL rendering and synchronization

Now we have everything to finally implement our push_triangle command. It will just push the three positions and colors into their respective buffers. However we need to be careful not to overflow so if the buffers are full we must force an early draw:

impl Renderer {
    //...

    /// Add a triangle to the draw buffer
    pub fn push_triangle(&mut self,
                         positions: [Position; 3],
                         colors:    [Color; 3]) {

        // Make sure we have enough room left to queue the vertex
        if self.nvertices + 3 > VERTEX_BUFFER_LEN {
            println!("Vertex attribute buffers full, forcing draw");
            self.draw();
        }

        for i in 0..3 {
            // Push
            self.positions.set(self.nvertices, positions[i]);
            self.colors.set(self.nvertices, colors[i]);
            self.nvertices += 1;
        }
    }
}

The draw command itself is not very complicated but we need to be careful to synchronize ourselves properly with the GPU. That means flushing our buffers before we ask the GPU to start drawing and then waiting for the rendering to finish before we touch the buffers again:

impl Renderer {
    //...

    /// Draw the buffered commands and reset the buffers
    pub fn draw(&mut self) {
        unsafe {
           // Make sure all the data from the persistent mappings
           // is flushed to the buffer
           gl::MemoryBarrier(gl::CLIENT_MAPPED_BUFFER_BARRIER_BIT);

           gl::DrawArrays(gl::TRIANGLES,
                          0,
                          self.nvertices as GLsizei);
        }

        // Wait for GPU to complete
        unsafe {
           let sync = gl::FenceSync(gl::SYNC_GPU_COMMANDS_COMPLETE,
                                     0);

           loop {
               let r = gl::ClientWaitSync(
                                 sync,
                                 gl::SYNC_FLUSH_COMMANDS_BIT,
                                 10000000);

               if r == gl::ALREADY_SIGNALED ||
                  r == gl::CONDITION_SATISFIED {
                   // Drawing done
                   break;
               }
           }
        }

        // Reset the buffers
        self.nvertices = 0;
    }
}

The call to MemoryBarrier makes sure the data written to the mapped buffer is visible by the GPU instead of, say, stuck in a CPU cache. We could avoid this call by mapping the buffer with the MAP_COHERENT_BIT access flag set but that might make writing to the buffers slower so it's not necessarily better.

The DrawArrays function is where the magic happens: it tells the GPU to draw nvertices as triangles. Once this command is issued the GPU will start working asynchronously so we must be careful: if we start pushing new data to the buffers before the GPU is done we might overwrite attributes that are still in use which may cause glitches.

To avoid that we simply wait for the GPU to finish by using a fence: FencSync creates a fence waiting for the current commands to complete and ClientWaitSync is used to wait for completion.

Finally we reset nvertices to 0 to start anew.

This method is actually pretty suboptimal: we stall our emulator completely when the GPU is working. We could improve this by using double buffering on for our attributes but let's leave that for later.

This draw command will render everything but it won't display anything until we swap the window's buffer. We can add a display command to do just that:

impl Render {
    //...

    /// Draw the buffered commands and display them
    pub fn display(&mut self) {
        self.draw();

        self.window.gl_swap_window();
    }
}

Now we need to figure out when to call this method. Normally we'd want to call it at each VSYNC, so 60 or 50 times per second depending on the video mode but we don't support GPU timings yet. Instead for the time being we can find a command that the BIOS calls once per frame and put the display call in there. Once such command seems to be "Set Drawing Offset" so let's put our call to display in there:

impl Gpu {
    //...

    /// GP0(0xE5): Set Drawing Offset
    fn gp0_drawing_offset(&mut self) {
        //...

        // XXX Temporary hack: force display when changing offset
        // since we don't have proper timings
        self.renderer.display();
    }
}

First output of our OpenGL renderer{width="\textwidth"}

[[fig:renderer_first_triangles]]{#fig:renderer_first_triangles label="fig:renderer_first_triangles"}

We should now finally be ready to draw ou first triangles. If you restart the emulator you should end up with the image in figure [fig:renderer_first_triangles]{reference-type="ref" reference="fig:renderer_first_triangles"}.

The two triangles start back-to-back and then move and shrink to their final position. Since we don't yet draw the background quad they're all drawn on top of each other which gives this color smearing effect. Note that the image has a weird aspect ratio (2:1) and that the logo is not centered, it's because we're displaying the entire VRAM framebuffer instead of just the 640x480 portion configured in the video output.

OpenGL debugging

You might have noticed that there's not a whole lot of error checking in my OpenGL code above. We could call 'GetError' after every OpenGL function but that's annoying an noisy. Instead I prefer to use the debug extension.

This extension logs errors, warnings, performance notices and other messages to an internal queue. We can then call GetDebugMessageLog to retreive the messages38:

/// Check for OpenGL errors using `gl::GetDebugMessageLog`. If a
/// severe error is encountered this function panics. If the OpenGL
/// context doesn't have the DEBUG attribute this *probably* won't do
/// anything.
pub fn check_for_errors() {
    let mut fatal = false;

    loop {
        let mut buffer = vec![0; 4096];

        let mut severity = 0;
        let mut source = 0;
        let mut message_size= 0;
        let mut mtype = 0;
        let mut id = 0;

        let count =
            unsafe {
                gl::GetDebugMessageLog(1,
                                       buffer.len() as GLsizei,
                                       &mut source,
                                       &mut mtype,
                                       &mut id,
                                       &mut severity,
                                       &mut message_size,
                                       buffer.as_mut_ptr() as *mut GLchar)
            };

        if count == 0 {
            // No messages left
            break;
        }

        buffer.truncate(message_size as usize);

        let message =
            match str::from_utf8(&buffer) {
                Ok(m) => m,
                Err(e) => panic!("Got invalid message: {}", e),
            };

        let source = DebugSource::from_raw(source);
        let severity = DebugSeverity::from_raw(severity);
        let mtype = DebugType::from_raw(mtype);

        println!("OpenGL [{:?}|{:?}|{:?}|0x{:x}] {}",
                 severity, source, mtype, id, message);

        if severity.is_fatal() {
            // Something is very wrong, don't die just yet in order to
            // display any additional error message
            fatal = true;
        }
    }

    if fatal {
        panic!("Fatal OpenGL error");
    }
}

We can then call the check_for_errors method after critical sections: in 'draw' for instance to check for errors in the past frame but also at the end of 'new' to make sure the initialization went well. There's one caveat though: the debug extension only works when we use a debug OpenGL context. We can get one by setting the CONTEXT_DEBUG attribute before we create the window:

sdl2::video::gl_set_attribute(
                    GLAttr::GLContextFlags,
                    sdl2::video::GL_CONTEXT_DEBUG.bits());

A debug context might be slower than a normal one though so we'll probably want to only activate this for troubleshooting (via a command line flag or something like that). For now performances don't matter in the least so we can leave it enabled at all times.

The error messages themselves are vendor specific but hopefully they should be helpful. For instance with my radeon card if I mess up my vertex shader by replacing vec3 by vec4 in the color affectation I get the following message:

OpenGL [High|ShaderCompiler|Error|0x1] 0:19(10):
error: too few components to vec4

Drawing quadrilaterals

Modern OpenGL doesn't support quads, only points, lines and triangles39. Fortunately for us, neither does the Playstation GPU! When a quad draw command is received it's interpreted as two triangles and drawn that way. This is significant for gouraud shaded quadrilaterals since it means that only three vertices are ever used to interpolate the color of any pixel in the quad. For textured quads it shouldn't make any difference.

We can emulate that behavior in a push_quad method:

Impl Renderer {
    //...

    /// Add a quad to the draw buffer
    pub fn push_quad(&mut self,
                     positions: [Position; 4],
                     colors:    [Color; 4]) {

        // Make sure we have enough room left to queue the vertex. We
        // need to push two triangles to draw a quad, so 6 vertex
        if self.nvertices + 6 > VERTEX_BUFFER_LEN {
            // The vertex attribute buffers are full, force an early
            // draw
            self.draw();
        }

        // Push the first triangle
        for i in 0..3 {
            self.positions.set(self.nvertices, positions[i]);
            self.colors.set(self.nvertices, colors[i]);
            self.nvertices += 1;
        }

        // Push the 2nd triangle
        for i in 1..4 {
            self.positions.set(self.nvertices, positions[i]);
            self.colors.set(self.nvertices, colors[i]);
            self.nvertices += 1;
        }
    }
}

We must duplicate the two vertices shared by the two triangles across one of the quad's diagonal so we end up with 6 vertices for a single quad. It's possible to avoid that duplication (for instance by using indexed rendering) but at that point it would be premature optimization.

Now all that's left to do is to is use push_quad to draw the monochrome and shaded quadrilaterals:

impl Gpu {
    //...

    /// GP0(0x28): Monochrome Opaque Quadrilateral
    fn gp0_quad_mono_opaque(&mut self) {
        let positions = [
            Position::from_gp0(self.gp0_command[1]),
            Position::from_gp0(self.gp0_command[2]),
            Position::from_gp0(self.gp0_command[3]),
            Position::from_gp0(self.gp0_command[4]),
            ];

        // Only one color repeated 4 times
        let colors = [ Color::from_gp0(self.gp0_command[0]); 4];

        self.renderer.push_quad(positions, colors);
    }

   /// GP0(0x38): Shaded Opaque Quadrilateral
    fn gp0_quad_shaded_opaque(&mut self) {
        let positions = [
            Position::from_gp0(self.gp0_command[1]),
            Position::from_gp0(self.gp0_command[3]),
            Position::from_gp0(self.gp0_command[5]),
            Position::from_gp0(self.gp0_command[7]),
            ];

        let colors = [
            Color::from_gp0(self.gp0_command[0]),
            Color::from_gp0(self.gp0_command[2]),
            Color::from_gp0(self.gp0_command[4]),
            Color::from_gp0(self.gp0_command[6]),
            ];

        self.renderer.push_quad(positions, colors);
    }
}

Even though we use per-vertex colors it's easy to draw monochrome primitives by repeating the same color. We have encountered a third quad command, gp0_quad_texture_blend_opaque but since we don't support textures we can't implement that correctly yet. In the meantime we can use a solid color instead, it won't look right but at least we'll see something:

impl Gpu {
    //...

    /// GP0(0x2C): Textured Opaque Quadrilateral
    fn gp0_quad_texture_blend_opaque(&mut self) {
        let positions = [
            Position::from_gp0(self.gp0_command[1]),
            Position::from_gp0(self.gp0_command[3]),
            Position::from_gp0(self.gp0_command[5]),
            Position::from_gp0(self.gp0_command[7]),
            ];

        // XXX We don't support textures for now, use a solid red
        // color instead
        let colors = [ Color(0x80, 0x00, 0x00); 4];

        self.renderer.push_quad(positions, colors);
    }
}

Playstation boot logo without textures{width="\textwidth"}

[[fig:bootlogo_notextures]]{#fig:bootlogo_notextures label="fig:bootlogo_notextures"}

Lo and behold, we should now have something that looks very much like the "Sony Computer Entertainment" boot logo, minus the text which is contained in the textures. Figure [fig:bootlogo_notextures]{reference-type="ref" reference="fig:bootlogo_notextures"} shows the expected output.

Playstation boot logo with bad quad rendering{width="\textwidth"}

[[fig:bootlogo_badquad]]{#fig:bootlogo_badquad label="fig:bootlogo_badquad"}

As before the black area at the right and bottom of the image is due to the fact that we display the entire framebuffer instead of just the part configured in the video output. You can see that a single 640x480 image already takes more than half of the entire VRAM and we're only displaying a very simple logo. Game developers back then had to be very careful with VRAM usage (and memory usage in general). This is also one of the reasons most games are rendered at lower resolutions like 640x240, but we'll see that later.

Note that there are two ways to split a quadrilateral in two triangles by cutting along either diagonal. The choice is significant, figure [fig:bootlogo_badquad]{reference-type="ref" reference="fig:bootlogo_badquad"} shows the result of splitting across the other diagonal40. You can see that the main "tilted square" behind the two triangles is shaded differently. If your emulator's output looks like this it means that you're not rendering the quads in the right order, you need to split along the other diagonal.

Draw Offset emulation

Our OpenGL renderer is very basic but we can at least add the draw offset easily. Of course the most obvious way would be to add it to the Positions before we put them in the attribute buffer but instead we can have the vertex shader do it for us!

In order to do this we can declare an "uniform" in the shader code:

//...

// Drawing offset
uniform ivec2 offset;

void main() {
  ivec2 position = vertex_position + offset;

  // Convert VRAM coordinates (0;1023, 0;511) into
  // OpenGL coordinates (-1;1, -1;1)
  float xpos = (float(position.x) / 512) - 1.0;
  // VRAM puts 0 at the top, OpenGL at the bottom,
  // we must mirror vertically
  float ypos = 1.0 - (float(position.y) / 256);

  gl_Position.xyzw = vec4(xpos, ypos, 0.0, 1.0);

  //...
}

Uniforms are inputs that are shared across all the intances of the shader. So instead of having an offset vertex attribute with one entry per vertex we can have a single variable that will be used for an entire batch of primitives.

To be able to modify the value of the uniform from our code we must retreive the index like we did for the vertex attributes. We can then set its value using Uniform2i41:

impl Renderer {
    //...

    /// Index of the "offset" shader uniform
    uniform_offset: GLint,
}

impl Renderer {

    pub fn new() -> Renderer {
        //...

        // Retreive and initialize the draw offset
        let uniform_offset = find_program_uniform(program,
                                                 "offset");
        unsafe {
            gl::Uniform2i(uniform_offset, 0, 0);
        }

        Renderer {
            //...

            uniform_offset: uniform_offset,
        }
    }

    //...
}

We can now add a method to set the value of the uniform. We need to be careful to draw the currently buffered primitives before we change the offset since those were supposed to be drawn with the previous value and might end up located at the wrong place:

impl Renderer {
    //...

    /// Set the value of the uniform draw offset
    pub fn set_draw_offset(&mut self, x: i16, y: i16) {
        // Force draw for the primitives with the current offset
        self.draw();

        // Update the uniform value
        unsafe {
            gl::Uniform2i(self.uniform_offset,
                          x as GLint,
                          y as GLint);
        }
    }
}

Finally we can get rid of our drawing_x_offset and drawing_y_offset member variables in the GPU and call set_draw_offset directly instead.

The fact that we have to force a partial draw every time the offset is changed means that in pathological cases this might end up being slower. For instance if a game draws thounsands of triangles, changing the offset between each one, we'll issue thousands of partial draw commands. In this case it would probably be faster to simply add the offset before we push the Positions in the attribute buffer.

Handling SDL2 events and exiting cleanly

Before we move on I want to fix one annoying problem introduced by our brand new SDL2 window: since we don't handle SDL events we can't exit the emulator cleanly. And since SDL2 catches SIGINT by default we can't even interrupt our emulator with ^C anymore.

Fortunately it's an easy fix: instead of initializing the SDL context in the Renderer we move it all the way up in the main function and then check for events in the top level loop. Since we need the SDL context to create the window we also have to shuffle constructors a bit: I've decided to create the renderer in the main function then move it into the Gpu constructor which is itself moved into the Interconnect constructor:

use sdl2::event::Event;
use sdl2::keycode::KeyCode;

fn main() {
    let bios = Bios::new(&Path::new("roms/SCPH1001.BIN")).unwrap();

    // We must initialize SDL before the interconnect is created since
    // it contains the GPU and the GPU needs to create a window
    let sdl_context = sdl2::init(::sdl2::INIT_VIDEO).unwrap();

    let renderer = Renderer::new(&sdl_context);
    let gpu = Gpu::new(renderer);
    let inter = Interconnect::new(bios, gpu);
    let mut cpu = Cpu::new(inter);

    let mut event_pump = sdl_context.event_pump();

    loop {
        for _ in 0..1_000_000 {
            cpu.run_next_instruction();
        }

        // See if we should quit
        for e in event_pump.poll_iter() {
            match e {
                Event::KeyDown { keycode: KeyCode::Escape, .. } => return,
                Event::Quit {..} => return,
                _ => (),
            }
        }
    }
}

When the Quit event is encountered (window closes, received SIGINT etc...) we return from main, effectively exiting the program. For convenience I also quit when the Escape key is pressed in the window.

The inner for loop is needed because checking for events before every instruction slows everything down very significantly so I only check once for every million instruction executed.

The Interconnect: Generic loads and stores

At this point we have three load and three store methods in our interconnect to deal with byte, halfword and word accesses. Those implementations look very similar to each other.

When we implement the debugger and the timings you'll see that we'll need special versions of those methods. At this rate we'll end up with dozens of memory access functions that will be very similar but for a few key differences.

This is a lot of potential code duplication. Fortunately we can avoid most of it by making our code use generics instead of having different flavors for 8, 16 and 32bit loads and store.

The first step is to create a generic Addressable trait:

/// Types of access supported by the Playstation architecture
#[derive(PartialEq,Eq,Debug)]
pub enum AccessWidth {
    Byte = 1,
    Halfword = 2,
    Word = 4,
}

/// Trait representing the attributes of a primitive addressable
/// memory location.
pub trait Addressable {
    /// Retreive the width of the access
    fn width() -> AccessWidth;
    /// Build an Addressable value from an u32. If the Addressable is 8
    /// or 16bits wide the MSBs are discarded to fit.
    fn from_u32(u32) -> Self;
    /// Retreive the value of the Addressable as an u32. If the
    /// Addressable is 8 or 16bits wide the MSBs are padded with 0s.
    fn as_u32(self) -> u32;
}

We can then implement this trait for u8, u16 and u32:

impl Addressable for u8 {
    fn width() -> AccessWidth {
        AccessWidth::Byte
    }

    fn from_u32(v: u32) -> u8 {
        v as u8
    }

    fn as_u32(&self) -> u32 {
        *self as u32
    }
}

impl Addressable for u16 {
    fn width() -> AccessWidth {
        AccessWidth::Halfword
    }

    fn from_u32(v: u32) -> u16 {
        v as u16
    }

    fn as_u32(&self) -> u32 {
        *self as u32
    }
}

impl Addressable for u32 {
    fn width() -> AccessWidth {
        AccessWidth::Word
    }

    fn from_u32(v: u32) -> u32 {
        v
    }

    fn as_u32(&self) -> u32 {
        *self
    }
}

Porting the CPU code

We can now factor our various memory access functions by making them generic over this Addressable trait. On the CPU it looks like this:

impl Cpu {
    //...    

    /// Memory read
    fn load<T: Addressable>(&self, addr: u32) -> T {
        self.inter.load(addr)
    }
    
    /// Memory write
    fn store<T: Addressable>(&mut self, addr: u32, val: T) {
        if self.sr & 0x10000 != 0 {
            // Cache is isolated, ignore write
            println!("Ignoring store while cache is isolated");
            return;
        }

        self.inter.store(addr, val);
    }
}

We can then replace the various load* and store* functions used in the CPU code by the generic versions. Most of the time the compiler can't infer the type properly (since we're casting all over the place to get the correct width and sign extension) so we have to explicitly tell it which of u8, u16 or u32 to use. For instance our LB implementation becomes:

impl Cpu {
    //...    

    /// Load Byte (signed)
    fn op_lb(&mut self,
             instruction: Instruction,
             debugger: &mut Debugger) {

        let i = instruction.imm_se();
        let t = instruction.t();
        let s = instruction.s();

        let addr = self.reg(s).wrapping_add(i);

        // Cast as i8 to force sign extension
        let v = self.load::<u8>(addr, debugger) as i8;

        // Put the load in the delay slot
        self.load = (t, v as u32);
    }
}

Porting the interconnect code

Then we need to port our interconnect code to use the generic interface. We have to merge the various load and store function in a single generic one. First the load function:

impl Interconnect {
    //...

    /// Interconnect: load value at `addr`
    pub fn load<T: Addressable>(&self, addr: u32) -> T {
        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::RAM.contains(abs_addr) {
            return self.ram.load(offset);
        }

        if let Some(offset) = map::BIOS.contains(abs_addr) {
            return self.bios.load(offset);
        }

        if let Some(offset) = map::IRQ_CONTROL.contains(abs_addr) {
            println!("IRQ control read {:x}", offset);
            return Addressable::from_u32(0);
        }

        if let Some(offset) = map::DMA.contains(abs_addr) {
            return self.dma_reg(offset);
        }

        if let Some(offset) = map::GPU.contains(abs_addr) {
            return self.gpu.load(offset);
        }

        if let Some(offset) = map::TIMERS.contains(abs_addr) {
            println!("Unhandled read from timer register {:x}",
                     offset);
            return Addressable::from_u32(0);
        }

        if let Some(_) = map::SPU.contains(abs_addr) {
            println!("Unhandled read from SPU register {:08x}",
                     abs_addr);
            return Addressable::from_u32(0);
        }

        if let Some(_) = map::EXPANSION_1.contains(abs_addr) {
            // No expansion implemented. Returns full ones when no
            // expansion is present
            return Addressable::from_u32(!0);
        }

        panic!("unhandled load at address {:08x}", addr);
    }
}

You can see that the Addressable::from_u32 function can be used to return a literal value without having to know the real type being used.

The store function is pretty straightforward:

impl Interconnect {
    //...

    /// Interconnect: store `val` into `addr`
    pub fn store<T: Addressable>(&mut self, addr: u32, val: T) {

        let abs_addr = map::mask_region(addr);

        if let Some(offset) = map::RAM.contains(abs_addr) {
            return self.ram.store(offset, val);
        }

        if let Some(offset) = map::IRQ_CONTROL.contains(abs_addr) {
            println!("IRQ control: {:x} <- {:08x}", offset,
                     val.as_u32());
            return;
        }

        if let Some(offset) = map::DMA.contains(abs_addr) {
            return self.set_dma_reg(offset, val);
        }

        if let Some(offset) = map::GPU.contains(abs_addr) {
            return self.gpu.store(offset, val);
        }

        if let Some(offset) = map::TIMERS.contains(abs_addr) {
            println!("Unhandled write to timer register");
            return;
        }

        if let Some(_) = map::SPU.contains(abs_addr) {
            println!("Unhandled write to SPU register");
            return;
        }

        if let Some(_) = map::CACHE_CONTROL.contains(abs_addr) {
            println!("Unhandled write to CACHE_CONTROL");
            return;
        }

        if let Some(offset) = map::MEM_CONTROL.contains(abs_addr) {
            match offset {
                0 => // Expansion 1 base address
                    if val != 0x1f000000 {
                        panic!("Bad expansion 1 base address");
                    },
                4 => // Expansion 2 base address
                    if val != 0x1f802000 {
                        panic!("Bad expansion 2 base address");
                    },
                _ =>
              println!("Unhandled write to MEM_CONTROL register"),
            }

            return;
        }

        if let Some(_) = map::RAM_SIZE.contains(abs_addr) {
            // We ignore writes at this address
            return;
        }

        if let Some(offset) = map::EXPANSION_2.contains(abs_addr) {
            println!("Unhandled write to expansion 2 register");
            return;
        }

        panic!("unhandled store into address {:08x}: {:08x}",
               addr, val.as_u32());
    }
}

Porting the RAM and BIOS

For the RAM we need to know how many bytes must be loaded or stored. We can use the Addressable::width method to figure it out:

impl Ram {
    //...

    /// Fetch the little endian value at `offset`
    pub fn load<T: Addressable>(&self, offset: u32) -> T {
        let offset = offset as usize;

        let mut v = 0;

        for i in 0..T::width() as usize {
            v |= (self.data[offset + i] as u32) << (i * 8)
        }

        Addressable::from_u32(v)
    }

    /// Store the 32bit little endian word `val` into `offset`
    pub fn store<T: Addressable>(&mut self, offset: u32, val: T) {
        let offset = offset as usize;

        let val = val.as_u32();

        for i in 0..T::width() as usize {
            self.data[offset + i] = (val >> (i * 8)) as u8;
        }
    }
}

The BIOS doesn't have a store method since it's read-only and we can reuse the RAM's load code without any change.

This looping and bit fiddling might seem a little under-optimized but LLVM seems to handle it well and generates code which looks almost exactly like the previous non-generic version. And we have less code duplication, so all is good.

Porting the GPU code

For the GPU I'll be a little more conservative: at this point I'm not sure how it behaves when we don't use 32bit for register reads and writes. For this reason I'll still just support 32bit access by checking what kind of generic I'm using:

impl Gpu {
    //...

    pub fn load<T: Addressable>(&self, offset: u32) -> T {

        if T::width() != AccessWidth::Word {
            panic!("Unhandled {:?} GPU load", T::width());
        }

        let r =
            match offset {
                0 => self.read(),
                4 => self.status(),
                _ => unreachable!(),
            };

        Addressable::from_u32(r)
    }

    pub fn store<T: Addressable>(&mut self, offset: u32, val: T) {

        if T::width() != AccessWidth::Word {
            panic!("Unhandled {:?} GPU load", T::width());
        }

        let val = val.as_u32();

        match offset {
            0 => self.gp0(val),
            4 => self.gp1(val),
            _ => unreachable!(),
        }
    }
}

Porting the DMA code

Likewise we only support 32bit access on the DMA registers so we can modify the code to reflect that:

impl Interconnect {
    //...

    /// DMA register read
    fn dma_reg<T: Addressable>(&self, offset: u32) -> T {

        if T::width() != AccessWidth::Word {
            panic!("Unhandled {:?} DMA load", T::width());
        }

        //...

        Addressable::from_u32(res)
    }

    /// DMA register write
    fn set_dma_reg<T: Addressable>(&mut self, offset: u32, val: T) {
        if T::width() != AccessWidth::Word {
            panic!("Unhandled {:?} DMA store", T::width());
        }

        let val = val.as_u32();

        //...
    }
}

Now our code should build and behave exactly like it did before. On my system the performance is the performance is the same as far as I can tell. This more generic infrastructure will show its usefulness soon enough.

The Debugger: Breakpoints and Watchpoints

This section is of optional but having a good debugger can save us a lot of time later on. Being able to disassemble the code, set breakpoints or watchpoints or step through the assembly are invaluable tools when we need to understand why some emulated game doesn't behave properly in our emulator.

Writing a good debugger frontend can be quite some work however. For simplicity's sake I've decided to implement the GDB remote protocol over a local TCP socket. This way I can just implement the low level debugging code in the emulator and I use a general purpose GDB binary targeting the MIPS architecture as a frontend. Then I can debug Playstation code almost like any program using GDB, I can disassemble the code, dump the data etc.... If I run code that I build myself I can even provide it with debugging symbols and step through functions and other high level niceties.

You might prefer to design the frontend yourself and integrate it directly in the emulator. It's more work but you may add Playstation-specific features more easily (GPU debugging comes to mind). For this reason I'm just going to describe the low level debugging interface in this guide, you'll decide what kind of frontend you want to build on top.

Debugger memory access {#sec:debug-mem-access}

This part is easy, we already have the generic load and store functions in our Cpu that we can use to access the memory. We can simply pass a reference to our Cpu in the debugger code and use that directly.

One potential issue with this approach is that loads and stores may have unintended side-effects when used from the debugger. For instance if we read from the GPUREAD register (when we properly implement it) we "pop" a word from the read buffer and it'll become unavailable when the real Playstation code wants to read it.

Later on when we implement the timings even reading from from regular RAM will take a few emulated CPU cycles which will effectively "waste" some time for the emulated code and might result in a missed interrupt or something similar.

Fortunately now that we have our generic load and store implementations we'll only have to implement a specialized version of those two functions if the side-effects become unmanageable in debugging code. Those specialized functions could ignore regular timings and even call specialized code in the various peripherals to prevent any state change.

For the time being I'll just call the regular load and store functions since we don't emulate enough side-effects to make a significant difference anyway. That might change as me become more accurate.

Breakpoints

Breakpoints are triggered when a certain instruction gets executed. The instruction is identified by its memory address. We can store the breakpoint addresses in a vector:

pub struct Debugger {
    /// Vector containing all active breakpoint addresses
    breakpoints: Vec<u32>,
}

We then need a pair of function for adding a deleting a breakpoint. It's a good idea to make sure we can't insert the same address twice: insertions and deletions are going to be rare while the address lookup will have to happen for every instruction so we want to keep the list as small as possible:

impl Debugger {
    /// Add a breakpoint that will trigger when the instruction at
    /// `addr` is about to be executed.
    fn add_breakpoint(&mut self, addr: u32) {
        // Make sure we're not adding the same address twice
        if !self.breakpoints.contains(&addr) {
            self.breakpoints.push(addr);
        }
    }

    /// Delete breakpoint at `addr`. Does nothing if there was no
    /// breakpoint set for this address.
    fn del_breakpoint(&mut self, addr: u32) {
        self.breakpoints.retain(|&a| a != addr);
    }
}

Finally we can implement the method pc_change that will be called before every instruction to look for a breakpoint at the current address. Needless to say this code is in a very critical path and must be as fast as possible:

impl Debugger {
    //...

    /// Called by the CPU when it's about to execute a new
    /// instruction. This function is called before *all* CPU
    /// instructions so it needs to be as fast as possible.
    pub fn pc_change(&mut self, cpu: &mut Cpu) {
        if self.breakpoints.contains(&cpu.pc()) {
            self.debug(cpu);
        }
    }
}

The debug method is where the debugging frontend should be notified that the execution stopped and wait for the user to resume the execution.

Using a vector to store the breakpoints might seem sub-optimal since it has linear lookup time. A tree-based collection could theoritically work in logarithmic time. We have to consider two things however: we want to optimize for the common case where no debugging is taking place and no breakpoint is set and even when we're debugging we probably won't be using thousands of breakpoints simultaneously.

Iterating over an empty vector should be very cheap: a simple test of the length of the vector and we exit the loop immediately. And even for small non-empty vectors it will probably be faster than a more complex structure (strong cache locality, no cache thrashing, no indirections, easy prefetching).

For these reasons I don't think it's necessary to bother using anything more complicated than a good old vector, the constant cost probably matters more than the linear complexity for our usage.

Finally we can plug pc_change in our CPU:

impl Cpu {
    //...

    /// Run a single CPU instruction and return
    pub fn run_next_instruction(&mut self, debugger: &mut Debugger) {
        // Synchronize the peripherals
        self.inter.sync(&mut self.tk);

        // Save the address of the current instruction to save in
        // `EPC` in case of an exception.
        self.current_pc = self.pc;

        // Debugger entrypoint: used for code breakpoints and stepping
        debugger.pc_change(self);

        //...
    }

    pub fn pc(&self) -> u32 {
        self.pc
    }
}

I pass the debugger object from the main function in order to be able to start a debugging session at the press of a key:

fn main() {
    //...

    let mut debugger = Debugger::new();

    let mut event_pump = sdl_context.event_pump();

    loop {
        for _ in 0..1_000_000 {
            cpu.run_next_instruction(&mut debugger);
        }

        // See if we should quit
        for e in event_pump.poll_iter() {
            match e {
                Event::KeyDown { keycode: KeyCode::Pause, .. } =>
                    debugger.debug(&mut cpu),
                Event::KeyDown { keycode: KeyCode::Escape, .. } =>
                    return,
                Event::Quit {..} => return,
                _ => (),
            }
        }
    }
}

In a quick benchmark this debugging code causes a small (but noticeable) degradation of the performances. I think it'll probably end up being well worth it. We could make the compilation of the debugging code optional to make it possible to have faster binaries when we don't want debugging but we never know when we might need it anyway and having several build configurations makes the code harder to test and could lead to code rot. The debugger could also potentially be used for cheating in games so it might make sense to leave it enabled even for "end user" builds.

Watchpoints

Being able to break on a specific instruction is useful but sometimes we want to know when a certain location in memory is loaded or modified. In order to do that we can implement read and write watchpoints that will respectively check each load and store address and trigger the debugger when a watched address is encountered.

As for breakpoints we'll store the watchpoint addresses in vectors:

pub struct Debugger {
    /// Vector containing all active read watchpoints
    read_watchpoints: Vec<u32>,
    /// Vector containing all active write watchpoints
    write_watchpoints: Vec<u32>,
}

The methods for adding, removing and testing the watchpoints will therefore look very similar to the breakpoint implementation:

impl Debugger {
    //...

    /// Add a breakpoint that will trigger when the CPU attempts to
    /// read from `addr`
    fn add_read_watchpoint(&mut self, addr: u32) {
        // Make sure we're not adding the same address twice
        if !self.read_watchpoints.contains(&addr) {
            self.read_watchpoints.push(addr);
        }
    }

    /// Delete read watchpoint at `addr`. Does nothing if there was no
    /// breakpoint set for this address.
    fn del_read_watchpoint(&mut self, addr: u32) {
        self.read_watchpoints.retain(|&a| a != addr);
    }

    /// Called by the CPU when it's about to load a value from memory.
    pub fn memory_read(&mut self, cpu: &mut Cpu, addr: u32) {
        // XXX: how should we handle unaligned watchpoints? For
        // instance if we have a watchpoint on address 1 and the CPU
        // executes a `load32 at` address 0, should we break? Also,
        // should we mask the region?
        if self.read_watchpoints.contains(&addr) {
            println!("Read watchpoint triggered at 0x{:08x}", addr);
            self.debug(cpu);
        }
    }

    /// Add a breakpoint that will trigger when the CPU attempts to
    /// write to `addr`
    fn add_write_watchpoint(&mut self, addr: u32) {
        // Make sure we're not adding the same address twice
        if !self.write_watchpoints.contains(&addr) {
            self.write_watchpoints.push(addr);
        }
    }

    /// Delete write watchpoint at `addr`. Does nothing if there was no
    /// breakpoint set for this address.
    fn del_write_watchpoint(&mut self, addr: u32) {
        self.write_watchpoints.retain(|&a| a != addr);
    }

    /// Called by the CPU when it's about to load a value from memory.
    pub fn memory_write(&mut self, cpu: &mut Cpu, addr: u32) {
        // XXX: same remark as memory_read for unaligned stores
        if self.write_watchpoints.contains(&addr) {
            println!("Write watchpoint triggered at 0x{:08x}", addr);
            self.debug(cpu);
        }
    }
}

You can see that I put a few comments about unaligned access and regions, I'm not entirely sure what's the right thing to do here. I guess we'll see how we want the debugger to behave as we're using it.

Now we just have to plug the memory read and write methods in our generic load and store functions in the CPU:

impl Cpu {
    //...

    /// Memory read
    fn load<T: Addressable>(&mut self,
                            addr: u32,
                            debugger: &mut Debugger) -> T {
        debugger.memory_read(self, addr);

        self.inter.load(&mut self.tk, addr)
    }

    /// Memory write
    fn store<T: Addressable>(&mut self,
                             addr: u32,
                             val: T,
                             debugger: &mut Debugger) {
        debugger.memory_write(self, addr);

        if self.sr.cache_isolated() {
            self.cache_maintenance(addr, val);
        } else {
            self.inter.store(&mut self.tk, addr, val);
        }
    }
}

We've added an additional debugger parameter to these two methods so we have to pass the debugger reference from run_next_instruction to decode_and_execute and finally to the various load and store methods that need to do memory access (op_sw, op_lw, op_swr, etc...).

There are two issues with this implementation however. First we use this store method to fetch instructions but we don't want to trigger a read watchpoint when we're loading instructions (that's what breakpoints are for). The fix is easy, we just call the interconnect's load method directly:

impl Cpu {
    //...

    /// Run a single CPU instruction and return
    pub fn run_next_instruction(&mut self, debugger: &mut Debugger) {
        //...

        // Fetch instruction at PC
        let pc = self.current_pc;
        let instruction = Instruction(self.inter.load(pc));

        //...
    }
}

An other problem is that you might be using this CPU load method in your debugger to read the memory's contents. Obviously you don't want to recursively trigger the debugger when you use it to read some memory location where a watchpoint happens to live. Instead we can create an other method used for loading data for debugging purposes42. I named this method examine:

impl Cpu {
    //...

    /// Debugger memory read
    pub fn examine<T: Addressable>(&mut self, addr: u32) -> T {
        self.inter.load(&mut self.tk, addr)
    }
}

Code disassembly and beyond

I didn't show any disassembler code since GDB does it for me but it shouldn't be too difficult to implement since MIPS instructions are fixed width. Just read the code you want to disassemble 32bits at a time and then implement something similar to our decode_and_execute method but instead of executing the instruction you return the disassembled code in a string for instance.

If you want to be fancy and support MISP assembler pseudo-instructions you'll have to handle certain instructions specifically, for instance could be displayed as nop while should be . Of course it's still correct if you keep the real instructions instead of the assembler shorthand but it's generally more readable if you use the later.

Later on we'll have to consider adding debugging for the GPU as well (displaying primitives, textures, exploring linked lists etc...).

The CPU: Instruction cache

Before we move on to the GPU timings let's start by implementing the CPU instruction cache. Without it we won't be able to emulate the CPU speed properly since cached code gets executed much faster than instructions that have to be fetched from RAM. The CPU also has a data cache but it's not used as a proper cache so we can leave that for later.

Instruction cache lookup behavior

The Playstation's CPU has a 4KB instruction cache that can contain up to 1024 instructions across 256 4-instruction cachelines. The cache is directly mapped which means that there's only one possible cacheline for any given memory address.

Here's how it works: each cacheline contains enough room for 4 instructions plus a tag and valid bits. The tag contains the upper 20bits of the physical address being cached, it is used to make sure we're really getting the data from the correct memory location and not some other address that happens to alias the cacheline. Then for each instruction in the cacheline a bit says if the entry is valid or not. When fetching an instruction if the tag is mismatched or the entry is not valid it'll have to be fetched from main memory, otherwise we can directly use the cached value.

Tag [31:12] Cacheline [11:4] Index [3:2] Word alignment [1:0]


 `0x80005`           `0x38`              `1`                  0

: Anatomy of cached address 0x80005384[]{label="tab:cached-address"}

Let's take a concrete example shown in table [tab:cached-address]{reference-type="ref" reference="tab:cached-address"}. Suppose the CPU wants to run code from address 0x80005384. First we need to figure out which cacheline matches this address, for that we need to shift the address two bits to the right (since we have 4 32bit words per cache line) and then take the 8 LSBs (since we have 256 cachelines in total). In this case we end up in cacheline number 56 (0x38).

Now that we have identified the cacheline we need to see if it already contains data for the current address, after all any address ending in 0x38X will match the same cache location. In order to do that we compare the tag stored in the line with bits [31:12] of the instruction address, in this case 0x80005. If the tag doesn't match we consider it invalid and we have to fetch it from RAM.

If the tag is the one we're looking for however we just have to check the valid bit for the instruction we're looking for. Bits [3:2] give us the location in the 4-word cacheline, bits [1:0] are always 0 since all instructions are word-aligned43 so in this case we're looking for the 2nd word in the cacheline. If the valid bit is set we can use it directly, otherwise the instruction is invalid and we must fetch it from main RAM.

Instruction cache fetch behavior

When an invalid instruction is encountered (either because the line has the wrong tag or the valid bit is not set) the Playstation cache will will update the tag to match the current address and then fetch the missing instruction as well as any following instruction in the same cacheline, but not the one before it.

For instance in the case of address 0x80005384 if we have a cache miss the instructions at addresses 0x80005384, 0x80005388 and 0x8000538c will be fetched (words at indexes 1, 2 and 3 respectively) but not 0x80005380 (the word at index 0). I suppose that if some of the following instructions are already valid they're not fetched again but I haven't tested it and it's probably not very common anyway.

Footnotes

  1. I'm using the GNU assembler syntax in this guide unless otherwise noted.

  2. MIPS assemblers actually feature a nop pseudo-instruction that generates this instruction.

  3. One exception would be memory loads which can have side effects even if the value is discarded in $zero.

  4. I'll skip the code in decode_and_execute from now on, I'm sure you can figure it out by yourself...

  5. For instance to an other region as we'll see later.

  6. I'm assuming that the assembler is not asked to reorder the instructions. To get this behaviour you have to use ".set noreorder" with the GNU assembler.

  7. This is actually pseudo-assembly for the sake of clarity. The correct GNU assembler syntax would be but it's a bit too ambiguous for my taste.

  8. It is called a branch delay slot after all...

  9. If you're implementing this in C or C++ and need to check for signed overflow yourself you'll find plenty of examples online. Welcome to the 1970s. Be careful with your implementation though because signed integer overflow is undefined behaviour in C.

  10. This behaviour is part of the MIPS I architecture. Later revisions (starting with MIPS II) don't have load delay slots, only branch delay slots.

  11. As I mentioned earlier MOVE is actually a pseudo-instruction that the assembler will expand into an addu $<target>, $<source>, $zero.

  12. Having separate functions for various width should make the code easier to follow for now but it does create some code duplication, later on I'll use generics to factor them in a single function.

  13. Note the use of a negative offset, if we hadn't implemented proper sign extension earlier this instruction would misbehave.

  14. Note the cast from u8 to i8 and finally u32 to force the sign extension.

  15. I'm actually not sure how to test that easily since I need to have an expansion plugged in the parallel connector to be able to run code on my console. Maybe I could start the code and unplug it but that doesn't sound too great... A better way would be to burn the test code on a CD and run it on a modchipped console.

  16. I'm using peudo-assembly again. The proper GNU assembler syntax would be mfc0 $2, $12

  17. IRQ is a common abbreviation for "Interrupt Request".

  18. Note that the immediate is still sign extended even though it's then used as an unsigned value.

  19. This is not entirely accurate when the exception occurs in a branch delay slot. We'll review that case in a minute

  20. This is only for branch delay slots, load delay slots behave normally exception-wise.

  21. It might be more efficient to add the test in the branch and jump instructions capable of setting an invalid PC but I don't really care about performance at that point and that would make the code more complicated

  22. Although on the Playstation the CPU is seriously gimped while the DMA is running as we'll see later

  23. Note that in this context ! in rust does the same thing as ~ in C: it's the bitwise NOT operator.

  24. Well, four really: two read only and two write only sharing the same addresses.

  25. Interesting bit of trivia: apparently the LWL and LWR instructions were patented. The patent expired in 2006 and some people claimed that it might also have covered software implementations. If that's true it means one could not have distributed our emulator without a license from MIPS Computer Systems.

  26. You'll notice that I split the register in individual variables, I prefer to do that when know I'll have to manipulate the fields individually. It makes the code clearer and less error prone in my experience. It has a small cost however: it takes up a little more memory and we have to pack/unpack them when handling registers read/writes.

  27. From now on I'm not going to bother putting the glue code in the interconnect here when it's straightforward. If you're having doubts you can look up the source code of the emulator in the repository.

  28. The Nocash docs speculate that bit 29 might be used to pause an ongoing transfer but that will require some more testing.

  29. And it's a good thing since at that point no start address has been set!

  30. I'll describe those GPU instructions in greater details later when we'll implement them.

  31. I've tried quickly disassembling the surrounding code but I couldn't really figure out what it's trying to do. I'll have to take the time to dig deeper at some point...

  32. The reason is that Rust obviously doesn't have a 11bit signed integer type so we have to shift to 16bits in order to get the correct sign in an i16, then we can shift back to 11bits.

  33. Those coordinates are not in VRAM but rather in the output's video signal system of coordinates.

  34. At the time of writing OpenGL 4.5 is the latest version but 3.3 is more widely supported and should suffice for what we're doing although we may end up using a couple extensions.

  35. The OpenGL C API concatenates the "gl" prefix to symbols ("GL_" for macros) so in C ClearColor would be glClearColor and COLOR_BUFFER_BIT would be GL_COLOR_BUFFER_BIT. When searching for an OpenGL symbol online it's sometimes better to use the C form.

  36. If you're not familiar with homogeneous coordinates don't worry, all you have to know for now is that you have to set the w component to 1.0 for a position and 0.0 for a vector.

  37. There are actually a couple more stages before that in modern OpenGL like the tesselation and geometry shaders but we don't need to bother with that.

  38. I'm leaving out the definition of the various Debug* types which are just thin wrappers around the OpenGL values, as always check the repository if you want to see the entire code.

  39. Although you can emulate proper quad shading in shaders if you really need to.

  40. I modified push_quad: instead of rendering triangles with vertex indexes [0, 1, 2] and [1, 2, 3] I used [2, 3, 0] and [3, 0, 1].

  41. The 2i part means that the function works on ivec2s, there are other Uniform* functions for the various other types.

  42. It can be used as a starting point for a "side-effect free" debugging path as I mentioned in section 7.1{reference-type="ref" reference="sec:debug-mem-access"}.

  43. Remember that we generate an exception if we ever end up with a misaligned PC so we can always assume that it's correctly aligned after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment