Skip to content

Instantly share code, notes, and snippets.

@jbush001
Last active November 14, 2017 13:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jbush001/09f51178a366c0f6b8f07363c30f414f to your computer and use it in GitHub Desktop.
Save jbush001/09f51178a366c0f6b8f07363c30f414f to your computer and use it in GitHub Desktop.

Problem Description

Writes to addresses at high physical memory addresses, instead of going through the normal cache hierachy, use a special I/O bus. While this is useful for relatively low-speed peripherals, it has performance limitiations when used with coprocessors such as texture fetch units:

  • It can only write a single, 32-bit word at a time. For vectorized compute code, this requires 32 instructions to copy a vector value into our out of it (getlane/write).
  • It takes a rollback for every read or write, since the I/O bus is shared by all cores and transactions need to globally arbitrate
  • There is no way for peripherals to make a thread wait for them to be ready, so threads must poll them, which wastes processor cycles and clogs up the bus
  • The bus does not have a concept of thread IDs, which requires some form of arbitration at the software or coprocessor level.

The proposal is to create a new bus for use in high-speed peripherals. This would not replace the low speed bus, but would address different use cases and constraints.

Implementation

The new bus would be specific to a single core and not shared globally by all cores. This would eliminate the overhead of arbitration among cores (although a peripheral could implement its own arbitrarion scheme). It would have the following interface:

interface coprocessor_bus_interface;
    logic write_en;
    logic read_en;
    scalar_t address;
    thread_idx_t thread_idx;
    vector_t write_data;
    vector_t read_data;
    logic ack;
    local_thread_bitmap_t wake_bitmap;

    modport master(output write_en, read_en, address, thread_idx, write_data,
        input read_data, ack, wake_bitmap);
    modport slave(input write_en, read_en, address, thread_idx, write_data,
        output read_data, ack, wake_bitmap);
endinterface    

When a write or read is performed, the next cycle, the peripheral would assert or deassert the ack signal.

  • If the ack signal is asserted, the peripheral was ready and the thread can continue executing without a rollback. If this was a read, the signal 'read_data' will contain data from the peripheral.
  • If the ack signal is deasserted, the pipeline will suspend the thread. The peripheral can later wake it by asserting sigals in 'wake_bitmap', which contains one bit per thread.

Peripherals can add FIFOs for writes and reads, only blocking threads when the FIFOs are full or empty.

Testing

  • Functional Need to create a dummy peripheral in the testbench.

    • Need a test that blocks on read/write and one that does not block
    • Test thread resume
  • Performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment