Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Problem Description

Writes to addresses at high physical memory addresses, instead of going through the normal cache hierachy, use a special I/O bus. While this is useful for relatively low-speed peripherals, it has performance limitiations when used with coprocessors such as texture fetch units:

  • It can only write a single, 32-bit word at a time. For vectorized compute code, this requires 32 instructions to copy a vector value into our out of it (getlane/write).
  • It takes a rollback for every read or write, since the I/O bus is shared by all cores and transactions need to globally arbitrate
  • There is no way for peripherals to make a thread wait for them to be ready, so threads must poll them, which wastes processor cycles and clogs up the bus
  • The bus does not have a concept of thread IDs, which requires some form of arbitration at the software or coprocessor level.

The proposal is to create a new bus for use in high-speed peripherals. This would not replace the low speed bus, but would address different use cases and constraints.

Implementation

The new bus would be specific to a single core and not shared globally by all cores. This would eliminate the overhead of arbitration among cores (although a peripheral could implement its own arbitrarion scheme). It would have the following interface:

interface coprocessor_bus_interface;
    logic write_en;
    logic read_en;
    scalar_t address;
    thread_idx_t thread_idx;
    vector_t write_data;
    vector_t read_data;
    logic ack;
    local_thread_bitmap_t wake_bitmap;

    modport master(output write_en, read_en, address, thread_idx, write_data,
        input read_data, ack, wake_bitmap);
    modport slave(input write_en, read_en, address, thread_idx, write_data,
        output read_data, ack, wake_bitmap);
endinterface    

When a write or read is performed, the next cycle, the peripheral would assert or deassert the ack signal.

  • If the ack signal is asserted, the peripheral was ready and the thread can continue executing without a rollback. If this was a read, the signal 'read_data' will contain data from the peripheral.
  • If the ack signal is deasserted, the pipeline will suspend the thread. The peripheral can later wake it by asserting sigals in 'wake_bitmap', which contains one bit per thread.

Peripherals can add FIFOs for writes and reads, only blocking threads when the FIFOs are full or empty.

Testing

  • Functional Need to create a dummy peripheral in the testbench.

    • Need a test that blocks on read/write and one that does not block
    • Test thread resume
  • Performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment