rajesh-s/axi_discussion.md

## axi_discussion.md

      
    Raw
  

              axi_discussion.md
            
          
    Delving into the why's of AXI

**Note: In all below, slave can also mean interconnect


Do we really need back-pressure?

Yes, you absolutely need backpressure. What happens when two masters want to access the same slave? One has to be blocked for some period of time. Some slaves may only be able to handle a limited number of concurrent operations and take some time to produce a result. As such, backpressure is required.
B and R channel backpressure is required in the case of contention towards the master. If a master makes burst read requests against two different slaves, one of them is gonna have to wait.

Shouldn't a master be prepared to receive the responses for any requests it issues from the moment it makes the request? Aside from the clock crossing issue someone else brought up, and the interconnect issue at the heart of the use of IDs, why should an AXI master ever stall R or B channels?

The master should be prepared, but it only has one R and one B input, so it can't receive two responses at the same time, especially read bursts that can last many cycles.


To cover any and all situations, yes. Otherwise, there is no way for the master interface to know if the slave interface can handle the throughput. If you go for a clock conversion to a slower clock for example, the clock converter needs a way to slow down the master. Back-pressure can also be used by the slave to wait for address AND data on writes to simplify the slave's design, for example. Side note: on AXI4-Stream, back pressure support is optional!


Do transaction sources really need identifiers? AxID, BID, or RID

Yes. The identifiers enable the interconnect to route transactions appropriately, enable masters to keep track of multiple outstanding reads or writes, etc.

Regarding IDs, can you provide more details on interconnect routing? I've built an interconnect, and didn't use them. Now, looking back, I can only see potential bugs that would show up if I did. Assuming a single ID, suppose master A makes a request of slave A. Then, before slave A replies, master A makes a request of slave B. Slave B's response is ready before slave A's, but now the interconnect needs to force slave B to wait until slave A is ready? The easy way around this would be to enforce a rule that says a master can only ever have one burst outstanding at a time, or perhaps can only ever talk to one slave with one ID (painful logic implementation) ... It just seems like it'd be simpler to build the interconnect without this hassle.

When multiple masters are connected to an interconnect, the ID field is usually extended so responses can be returned to the correct master. Also, the interconnect needs logic to prevent reordering for the same ID. The stupid way to do this is to limit to a single in flight operation. The better way to do it is to keep track of outstanding operation counts per ID and preventing the same ID from the same master from being used on more than one slave at the same time (this is how the Xilinx crossbar interconnect works).

In the interconnect you can append some ID bits to identify the master in the AR channel, and then use those bits to route the R channel back to the appropriate master, so you don't need to have any logic between those channels in the interconnect.


This is a good point, and worth discussing--especially since this is the stated purpose of the various ID bits. That said, have you thought through how this would need to be implemented? Consider the following scenario:

Master A, with some ID, issues a request to read from slave A. Let's say it's a burst request for 4 elements.
This request gets assigned an Id, we'll call it AA, and then gets routed to slave A.
Let's allow that slave A is busy, so the burst doesn't get processed immediately.
Master A then issues a second request, using the same ID but let's say this time it's a request to read 256 elements from slave B. The interconnect then assigns an ID - to this request, we can call this new ID AB
Slave B isn't busy, so it processes the request immediately. It sends it's response back.
The interconnect now routes ID AB back to master A, which now receives 256 elements of a burst when it's still expecting a read return of 4 elements.

Sure, this is easy to fix with enough logic, but how much logic would it take to fix this? The interconnect would need to map each of master A's potential ID's to slaves. This requires a minimum of two burst counters, one for reads and one for writes, for every possible ID.The interconnect would then be required to stall any requests from master A, coming from a specific ID, if 1. it were being sent to a different slave 2. requests for the first slave remained outstanding. So, yes, it could be done ... but is the extra complexity worth the gain? Indeed, is there a gain to be had at all and how significant is that gain?

The Xilinx Crossbar core adresses this issue through a method they call Single Slave per ID (page 78). In your example, Master A's second request would be stalled until the first request completes.

Also, mentioned by ARM in their interconnect implementations. Reference


So if the master issues two reads with the same ID to two different slaves, generally the interconnect will stall the second operation until the first one completes. It's probably possible to do better than this, but it would require more logic, and would result in blocking somewhere else (i.e. blocking the second read response until the first one completes). Is it worth it? Depends. Like a lot of things, there are trade-offs. I think the assumption of AXI is that the master will issue operations with different IDs so the interconnect can reorder them at will.Also, you don't need counters for all possible IDs, you can use a limited set of counters and allocate and address them on the fly, CAM-style.

This is a good point, and I thank you for bringing it up. So, basically you could do an ID reassignment and then perhaps keep only 2-4 active IDs and burst transaction counters for those. If a request for another ID came in while all of those were busy, you'd then wait for an ID to be available to be re-allocated to map to this one. I just cringe at all the extra logic it would take to implement this.

The logic is actually not all that complex.


When I built my own interconnects, I found that I could do just fine without IDs. When I later considered using them to route returns back to their master, the logic required to keep them in order even when the master is accessing multiple slaves appeared to not be worth the effort. So .. why do we need them?
Use them to keep a strict order. Use them as a foundation for a coherency ruleset. Use them to give an implied priority. Interconnects have a particularly tough job with managing outstanding transactions. It gets complex very fast. Even ARM's NIC-400 has very strict limitations on the number of outstanding transactions. I think this would be a problem regardless of the presence of IDs.
Slave devices absolutely must implement the ID signals properly. Xilinx's demo AXI slave design doesn't
Not necessarily! On master interfaces, they are all optional, because many masters don't need to make use of this capability. It especially makes sense for interconnect blocks with multiple master interfaces: The interconnect block needs to assign an ID to each transaction to be able to tell which transaction belongs to which master. For this to work, of course, the ID signals are required on slave interfaces. To make it easier on yourself, you can design the slave to simply work with a single ID, for which you only need a single register where you can store the ID until the transaction is over.
The IDs are also needed for things that are not related to interconnects: You can hide read latency with multiple outstanding requests. You can take advantage of slave features like command reordering with DDR.

Reordering can be totally worth it, it depends a little on your use case and adressing pattern, but if you can avoid one activate-precharge sequence by reordering commands, you can save up to 50 dram cycles. It increases you throughput drastically. In general, the latency of a SDRAM is quite bad due to its architecture and I think most of the time SDRAM cores are trimmed towards throughput. (In all Applications I have used SDRAM the latency wasn't a factor only throughput)


I'm unaware of any slaves that reorder their returns. Is this really a useful capability?

They can. For instance, an AXI slave to PCIe bus master module that converts AXI operations to PCIe operations. PCIe read completions can come back in strange orders. Additionally, multiple requests made through an interconnect to multiple slaves that have different latencies will result in reordering.


Slaves need to synchronize the AW channel with the W channel in order to perform any writes, so do we really need two separate channels?**

This one is somewhat debatable, but one cycle of AW can result in many cycles on W, so splitting them makes sense. It makes storing the write data in a FIFO more efficient as the address can be stored in a shallower FIFO or in a simpler register without significantly degrading throughput.


Many IP slaves I've examined arbitrate reads and writes into a single channel. Why maintain both?

Because there are slaves that don't do this, and splitting the channels means you can get a significant increase in performance when reads don't block writes and vise-versa.
I think the split is certainly worth the cost. The data path is already split, and the data path can be far wider than the address path. The design I wrote my AXI library for had a 256 or 512 bit data path, so the overhead for a few extra address lines wasn't much. Also, it makes it very easy to split the read and write connections across separate read only and write only interfaces without requiring any extra arbitration or filtering logic. This is especially useful for DMA logic where the read and write paths can be completely separate. It also means you can build AXI RAMs that use both ports of block RAMs to eliminate contention between reads and writes and get the best possible throughput.

Absolutely! However, what eats me up is when you pay all this extra price to get two separate channels to memory, one read and one write, and then the memory interface arbitrates between the two halves (Xilinx's block RAM controller) so that you can only ever read or write to the memory never both. This leaves me wondering why pay the cost when you aren't going to use it?

Does the Xilinx block RAM controller really arbitrate? That's just silly. It's not that hard to split it: Link. The master cannot receive two blocks of read data at the same time as it only has one R channel interface, hence the interconnect has to stall the other read response until the first one completes.


Isolated read and write channels are a must in a full-duplex system. Sure, you could further isolate the paths by by using a unidirectional protocol, one for each direction. But for a full-duplex system, you can't use a standard where, at any point, read and write paths share some kind of resource. Not if you want maximum bandwidth. Having independent channels doesn't guarantee that the two directions don't share a critical path or resource, but it gives the designer the ability to isolate them.
For such simple slaves there is also AXI-APB which is more like a "classic" single channel bus. When viewing at block diagrams of typical ARM based ASIC designs often APB is used for all the "low-speed" peripherals. Indeed AXI4-lite is a bit awkward, because it is "AXI4 with all its features (like burst, transactions) disabled". To my experience the main factor for added area of a AXI4-Lite vs. e.g. Wishbone comes form the RREADY signal: Because the Master can decide to be not ready to receive data you always need a additional 32 bit register to buffer the read response from the slave. With Wishbone the master is always ready "by design". But on the other hand the additional logic of an AXI4 Lite slave compared to wishbone is in a magnitude of 10 slices, not something which really matters much on e.g. an Artix or Zynq device.


Burst protocols require counters, and complex addressing requires next-address logic in both slave and master. Why not just transmit the address together with the request like AXI-lite would do?

Knowing the burst size in advance enables better reasoning about the transfer. It also means that cycles required for arbitration don't necessarily impact the throughput, presuming the burst size is large enough

Knowing burst size in advance can help ... how? And once you've paid the latency of arbitration in the interconnnect, why pay it again for the next burst? You can achieve interconnect performance with full throughput (1 beat/clock across bursts). You don't need the burst length to do this. Using the burst length just slows the non-burst transactions.

For the burst length, it's needed for reads anyway, using the same format for writes keeps things consistent. It can also be used to help manage buffer space in caches and FIFOs. As far as using the burst length for hiding the arbitration latency, it's possible that the majority of operations will be burst operations, and you might have to pay the latency penalty on every transfer of they are going to different slaves.


I prefer AXI lite for most situations. Just send the address with the data, and it can still support one data transfer per cycle. Back pressure is needed and I’m ok with some slaves supporting simultaneous reads and writes

My frustration with AXI-lite has been in the number of AXI implementations that cripple it. For example, if it takes N+L cycles to reply to a burst of N beats, with L clocks of latency, then using AXI-lite will cripple the design to a bus efficiency of 1+L clocks for every transaction. It doesn't have to be this way, but much of the IP I've examined sadly does this.


For "bursty" SDRAM access like cache line refills in many cases you need two counters anyway (e.g. one in the DRAM controller to present the column addresses and one in the cache controller do address the cache RAM). Transferring consecutive addresses over the bus/interconnect cost additional energy. This is not an issue on FPGAs where you don't use power-gating, but I can imagine that it matters in ASICs. In addition it will also make buffering interconnects more complex, you need to buffer all the addresses, which for sure consumes more area than a simple counter on master and slave.


Whether or not something is cachable is really determined by the interconnect, not the bus master. Why have an AxCACHE line?

Not in an ARM device. On a Cortex-M device with an MPU, the core can configure cacheability on a region-by-region basis.
The master needs to be able to force certain operations to not be cached or to be cached in certain ways. Those signals control how the operation is cached. Obviously, if there are no caches, the signals don't really serve a purpose. But providing them means that caching can be controlled in a standardized way.
If you implement something like atomic memory operations (e.g. RISC-V A-extension) you need control over the way data is cached. Another use case are bus masters doing block-transfers, you can improve performance when you know you write a full cache-line. The Cache control signals are intended for communication between the bus master and system side caches.


I can understand having the privileged vs unprivileged, or instruction vs data flags of AxPROT, but why the secure vs unsecure flag? It seems to me that either the whole system should be "secure", or not secure, and that it shouldn't be an option of a particular transaction

Because an ARM core can be running in a secure or an insecure context, and some devices may want to limit access from non-secure contexts.
In general the interconnect will reject the access (and report this in bresp or rresp) instead of the slave. But either is possible.
Secure is essentially a privilege level higher than privileged. It is used for ARM trust zone, etc. for implementing things that even the OS cannot touch.


In the case of arbitrating among many masters, you need to pick which masters are asking for which slaves by address. To sort by QoS request requires more logic and hence more clocks. In other words, we slowed things down in order to speed them up. Is this really required?

The QoS lines are present so that there is a standardized way of controlling the interconnect. The interconnect is not required to use those signals.


Are the above concerns equally valid for ASIC & FPGA?

For smaller FPGA designs switching from e.g. Wishbone to AXI4 does not gain any value, besides being compatible with standard vendor IP like XIlinx MIG or other IP Cores. If you have a single-core in-order processor as the primary bus master in your design there is no benefit from any form of out-of-order transaction processing in your periphery. Typical FPGA designs have a single SDRAM chip, so there is also no benefit from reordering of DRAM accesses. AXI4 is designed by ARM for there "A"-type multi-cores. Modern out-of-order cores can have ~200 machine instructions "in flight". Also in ASICs area is less a concern than power. As long as you can clock-gate or power-gate it additional logic is not a real problem.


Could AXI likely leak data or is exploitable if you allow malicious code to run in a VM or some other core on the same device. How prepared is it to deal with Spectre?

Any interface could possibly expose timing side-channels. This is certainly not limited to AXI. But many of the vulnerabilities you refer to (spectre, et al.) have more to do with the CPU architecture itself - and which operations it initiates and under what circumstances - and have little to do with the interfaces over which those operations are carried out.


Can the terms "AXI" and "bus" be used interchangably?

By definition of terms, a "bus" is an interface where you can have more than one device on the same phyiscal connections (eg, sharing the same wires) and you get into issues with things like bus mastering, contention, etc. Think about I2C in this case. AXI avoids things like that because it is purely a master/slave point-to-point interface. Even AXI cross-bars like Xilinx implements are still just point-to-point with a little bit of man-in-the-middle translation going on.

In AXI, the MASTER always starts the transaction by kicking off the infamous READY/VALID handshaking (see here for more info -- https://vhdlwhiz.com/axi-fifo/). The SLAVE cannot do this. In a true bus, anyone can kick off a data transaction to anyone else by knowing the correct address.
In terms of the transaction types, I bring that up because it's important in this context because for a bus like USB, isochronous transfers are detrimental to other devices on the bus. For example, if I hook up a USB microphone or camera that is isochronous, this will have a negative impact on (for example) data transfers from a USB stick or keyboard because the bus is designed to prioritize this data. This is a non-issue in AXI (with perhaps the situation of if you are using a cross-bar as noted above) because the transaciton is point-to-point.


Why do we have the 4kB boundary restriction?

There are two reasons for the 4KB boundary restriction. First, interconnect addressing granularity is also 4KB, so the interconnect does not have to deal with splitting bursts across multiple slaves. The second reason has to do with the MMU. This is intended to prevent operations from crossing page boundaries, as the MMU will translate virtual addresses to physical addresses on a page by page basis, where a page is commonly 4KB. PCIe has the same restriction. Yes, it is a bit annoying to enforce this, but it is necessary to prevent bursts from accessing multiple slaves. The timing penalty associated with splitting transfers at the burst length or at 4k boundaries, times two for PCIe and AXI, that's annoying. Reference


To me, a "high performance protocol" is one that allows one beat of information to be communicated on every clock. Many if not most of the AXI implementations I've seen don't actually hit this target simply because all of the extra logic required to implement the bus slows it down. There is also concerns of lost throughput/latency

I don't get where you assumption comes from, that you cannot transfer data every cycle? With the write channel, you can assert the control and data signals at the same cycle (and more data with a burst) and you get 100% throughput (assuming the slave is always ready, if not its not the protocols fault). On the read channel, you can send reads back-to-back to hide the latency (assuming the slave can handle multiple reads) or the latency is zero, then the slave can assert the data signals every cycle (assuming the master is always ready to receive, if not its not the protocols fault) and you get once again 100% throughput. One can argue, that the protocol has a lot of signals and thus quite some overhead, but either you need those extra signals for performance or they are static and your tool of choice can synthesise them away. The same thing comes done to the split read and write channels. If you have independent resources for read and write (eg IOs, transceiver, FIFOs, etc), you can achieve 100% throughput in both directions and if you have just one resources, either use it in one direction or arbitrate between read and write. But in both cases you can easily scale to your application needs. Note: For simple peripheral register interfaces (non burst) always use AXI-lite. Oh the reordering can be totally worth it, it depends a little on your use case and adressing pattern, but if you can avoid one activate-precharge sequence by reordering commands, you can save up to 50 dram cycles. It increases you throughput drastically. In general, the latency of a SDRAM is quite bad due to its architecture and I think most of the time SDRAM cores are trimmed towards throughput. (In all Applications I have used SDRAM the latency wasn't a factor only throughput)
It's less about latency and more about bandwidth. AXI is designed to move around large blocks of data, such as full cache lines at once. Single word operations are not the priority - it is expected that most of those will be satisfied by the CPU instruction and data caches directly - and it may not be possible to saturate an AXI interface with single word operations. Same goes for memory controllers. Running at a higher clock speed and keeping the interface busy is likely more important than getting the minimum possible latency for most applications - after all, the system CPU could be running some other thread while waiting for the read data to show up in the cache.
I think AXI4 is designed for situations where the core logic is much faster than e.g. the memory. In FPGAs the situation is the other way around, that is the reason why yo need a 128 Bit AXI4 Bus to match the data rate of a 16 Bit DDR-RAM chip. On a "real" CPU refilling a cache line from DRAM will cost you 200 or more clock cycles. It doesn't matter when your bus protocol adds 10 cycles on top. But you won't your interconnect be blocked while the waiting for this incredibility slow memory system.