nlitsme/riscv.md

## riscv.md

      
    Raw
  

              riscv.md
            
          
    Volume I: RISC-V Unprivileged ISA V20191214-draft
Preface

This document describes the RISC-V unprivileged architecture.
The ISA modules marked Ratified have been ratified at this time. The
modules marked Frozen are not expected to change significantly before
being put up for ratification. The modules marked Draft are expected
to change before ratification.
The document contains the following versions of the RISC-V ISA modules:


Base
Version
Status


RVWMO
2.0
Ratified


RV32I
2.1
Ratified


RV64I
2.1
Ratified


RV32E
1.9
Draft


RV128I
1.7
Draft


Extension
Version
Status


M
2.0
Ratified


A
2.1
Ratified


F
2.2
Ratified


D
2.2
Ratified


Q
2.2
Ratified


C
2.0
Ratified


Counters
2.0
Draft


L
0.0
Draft


B
0.0
Draft


J
0.0
Draft


T
0.0
Draft


P
0.2
Draft


V
0.7
Draft


Zicsr
2.0
Ratified


Zifencei
2.0
Ratified


Zihintpause
2.0
Ratified


Zihintntl
0.2
Draft


Zam
0.1
Draft


Zfh
1.0
Ratified


Zfhmin
1.0
Ratified


Zfinx
1.0
Ratified


Zdinx
1.0
Ratified


Zhinx
1.0
Ratified


Zhinxmin
1.0
Ratified


Zmmul
1.0
Ratified


Ztso
0.1
Frozen


Preface to Document Version 20191213-Base-Ratified

This document describes the RISC-V unprivileged architecture.
The ISA modules marked Ratified have been ratified at this time. The
modules marked Frozen are not expected to change significantly before
being put up for ratification. The modules marked Draft are expected
to change before ratification.
The document contains the following versions of the RISC-V ISA modules:


Base
Version
Status


RVWMO
2.0
Ratified


RV32I
2.1
Ratified


RV64I
2.1
Ratified


RV32E
1.9
Draft


RV128I
1.7
Draft


Extension
Version
Status


M
2.0
Ratified


A
2.1
Ratified


F
2.2
Ratified


D
2.2
Ratified


Q
2.2
Ratified


C
2.0
Ratified


Counters
2.0
Draft


L
0.0
Draft


B
0.0
Draft


J
0.0
Draft


T
0.0
Draft


P
0.2
Draft


V
0.7
Draft


Zicsr
2.0
Ratified


Zifencei
2.0
Ratified


Zam
0.1
Draft


Ztso
0.1
Frozen


The changes in this version of the document include:


The A extension, now version 2.1, was ratified by the board in
December 2019.


Defined big-endian ISA variant.


Moved N extension for user-mode interrupts into Volume II.


Defined PAUSE hint instruction.


Preface to Document Version 20190608-Base-Ratified

This document describes the RISC-V unprivileged architecture.
The RVWMO memory model has been ratified at this time. The ISA modules
marked Ratified, have been ratified at this time. The modules marked
Frozen are not expected to change significantly before being put up
for ratification. The modules marked Draft are expected to change
before ratification.
The document contains the following versions of the RISC-V ISA modules:


Base
Version
Status


RVWMO
2.0
Ratified


RV32I
2.1
Ratified


RV64I
2.1
Ratified


RV32E
1.9
Draft


RV128I
1.7
Draft


Extension
Version
Status


Zifencei
2.0
Ratified


Zicsr
2.0
Ratified


M
2.0
Ratified


A
2.0
Frozen


F
2.2
Ratified


D
2.2
Ratified


Q
2.2
Ratified


C
2.0
Ratified


Ztso
0.1
Frozen


Counters
2.0
Draft


L
0.0
Draft


B
0.0
Draft


J
0.0
Draft


T
0.0
Draft


P
0.2
Draft


V
0.7
Draft


N
1.1
Draft


Zam
0.1
Draft


The changes in this version of the document include:


Moved description to Ratified for the ISA modules ratified by
the board in early 2019.


Removed the A extension from ratification.


Changed document version scheme to avoid confusion with versions of
the ISA modules.


Incremented the version numbers of the base integer ISA to 2.1,
reflecting the presence of the ratified RVWMO memory model and
exclusion of FENCE.I, counters, and CSR instructions that were in
previous base ISA.


Incremented the version numbers of the F and D extensions to 2.2,
reflecting that version 2.1 changed the canonical NaN, and version
2.2 defined the NaN-boxing scheme and changed the definition of the
FMIN and FMAX instructions.


Changed name of document to refer to “unprivileged” instructions as
part of move to separate ISA specifications from platform profile
mandates.


Added clearer and more precise definitions of execution
environments, harts, traps, and memory accesses.


Defined instruction-set categories: standard, reserved,
custom, non-standard, and non-conforming.


Removed text implying operation under alternate endianness, as
alternate-endianness operation has not yet been defined for RISC-V.


Changed description of misaligned load and store behavior. The
specification now allows visible misaligned address traps in
execution environment interfaces, rather than just mandating
invisible handling of misaligned loads and stores in user mode.
Also, now allows access-fault exceptions to be reported for
misaligned accesses (including atomics) that should not be emulated.


Moved FENCE.I out of the mandatory base and into a separate
extension, with Zifencei ISA name. FENCE.I was removed from the
Linux user ABI and is problematic in implementations with large
incoherent instruction and data caches. However, it remains the only
standard instruction-fetch coherence mechanism.


Removed prohibitions on using RV32E with other extensions.


Removed platform-specific mandates that certain encodings produce
illegal instruction exceptions in RV32E and RV64I chapters.


Counter/timer instructions are now not considered part of the
mandatory base ISA, and so CSR instructions were moved into separate
chapter and marked as version 2.0, with the unprivileged counters
moved into another separate chapter. The counters are not ready for
ratification as there are outstanding issues, including counter
inaccuracies.


A CSR-access ordering model has been added.


Explicitly defined the 16-bit half-precision floating-point format
for floating-point instructions in the 2-bit fmt field.


Defined the signed-zero behavior of FMIN.fmt and FMAX.fmt, and
changed their behavior on signaling-NaN inputs to conform to the
minimumNumber and maximumNumber operations in the proposed IEEE
754-201x specification.


The memory consistency model, RVWMO, has been defined.


The “Zam” extension, which permits misaligned AMOs and specifies
their semantics, has been defined.


The “Ztso” extension, which enforces a stricter memory consistency
model than RVWMO, has been defined.


Improvements to the description and commentary.


Defined the term IALIGN as shorthand to describe the
instruction-address alignment constraint.


Removed text of P extension chapter as now superseded by active task
group documents.


Removed text of V extension chapter as now superseded by separate
vector extension draft document.


Preface to Document Version 2.2

This is version 2.2 of the document describing the RISC-V user-level
architecture. The document contains the following versions of the RISC-V
ISA modules:


Base
Version
Draft Frozen?


RV32I
2.0
Y


RV32E
1.9
N


RV64I
2.0
Y


RV128I
1.7
N


Extension
Version
Frozen?


M
2.0
Y


A
2.0
Y


F
2.0
Y


D
2.0
Y


Q
2.0
Y


L
0.0
N


C
2.0
Y


B
0.0
N


J
0.0
N


T
0.0
N


P
0.1
N


V
0.7
N


N
1.1
N


To date, no parts of the standard have been officially ratified by the
RISC-V Foundation, but the components labeled “frozen” above are not
expected to change during the ratification process beyond resolving
ambiguities and holes in the specification.
The major changes in this version of the document include:


The previous version of this document was released under a Creative
Commons Attribution 4.0 International License by the original
authors, and this and future versions of this document will be
released under the same license.


Rearranged chapters to put all extensions first in canonical order.


Improvements to the description and commentary.


Modified implicit hinting suggestion on JALR to support more
efficient macro-op fusion of LUI/JALR and AUIPC/JALR pairs.


Clarification of constraints on load-reserved/store-conditional
sequences.


A new table of control and status register (CSR) mappings.


Clarified purpose and behavior of high-order bits of fcsr.


Corrected the description of the FNMADD.fmt and FNMSUB.fmt
instructions, which had suggested the incorrect sign of a zero
result.


Instructions FMV.S.X and FMV.X.S were renamed to FMV.W.X and FMV.X.W
respectively to be more consistent with their semantics, which did
not change. The old names will continue to be supported in the
tools.


Specified behavior of narrower (<FLEN) floating-point values held
in wider f registers using NaN-boxing model.


Defined the exception behavior of FMA(∞, 0, qNaN).


Added note indicating that the P extension might be reworked into an
integer packed-SIMD proposal for fixed-point operations using the
integer registers.


A draft proposal of the V vector instruction-set extension.


An early draft proposal of the N user-level traps extension.


An expanded pseudoinstruction listing.


Removal of the calling convention chapter, which has been superseded
by the RISC-V ELF psABI Specification .


The C extension has been frozen and renumbered version 2.0.


Preface to Document Version 2.1

This is version 2.1 of the document describing the RISC-V user-level
architecture. Note the frozen user-level ISA base and extensions IMAFDQ
version 2.0 have not changed from the previous version of this
document , but some specification holes have been fixed and the
documentation has been improved. Some changes have been made to the
software conventions.


Numerous additions and improvements to the commentary sections.


Separate version numbers for each chapter.


Modification to long instruction encodings >64 bits to
avoid moving the rd specifier in very long instruction formats.


CSR instructions are now described in the base integer format where
the counter registers are introduced, as opposed to only being
introduced later in the floating-point section (and the companion
privileged architecture manual).


The SCALL and SBREAK instructions have been renamed to ECALL and
EBREAK, respectively. Their encoding and functionality are
unchanged.


Clarification of floating-point NaN handling, and a new canonical
NaN value.


Clarification of values returned by floating-point to integer
conversions that overflow.


Clarification of LR/SC allowed successes and required failures,
including use of compressed instructions in the sequence.


A new RV32E base ISA proposal for reduced integer register counts,
supports MAC extensions.


A revised calling convention.


Relaxed stack alignment for soft-float calling convention, and
description of the RV32E calling convention.


A revised proposal for the C compressed extension, version 1.9.


Preface to Version 2.0

This is the second release of the user ISA specification, and we intend
the specification of the base user ISA plus general extensions (i.e.,
IMAFD) to remain fixed for future development. The following changes
have been made since Version 1.0  of this ISA specification.


The ISA has been divided into an integer base with several standard
extensions.


The instruction formats have been rearranged to make immediate
encoding more efficient.


The base ISA has been defined to have a little-endian memory system,
with big-endian or bi-endian as non-standard variants.


Load-Reserved/Store-Conditional (LR/SC) instructions have been added
in the atomic instruction extension.


AMOs and LR/SC can support the release consistency model.


The FENCE instruction provides finer-grain memory and I/O orderings.


An AMO for fetch-and-XOR (AMOXOR) has been added, and the encoding
for AMOSWAP has been changed to make room.


The AUIPC instruction, which adds a 20-bit upper immediate to the
pc, replaces the RDNPC instruction, which only read the current
pc value. This results in significant savings for
position-independent code.


The JAL instruction has now moved to the U-Type format with an
explicit destination register, and the J instruction has been
dropped being replaced by JAL with rd=x0. This removes the only
instruction with an implicit destination register and removes the
J-Type instruction format from the base ISA. There is an
accompanying reduction in JAL reach, but a significant reduction in
base ISA complexity.


The static hints on the JALR instruction have been dropped. The
hints are redundant with the rd and rs1 register specifiers for
code compliant with the standard calling convention.


The JALR instruction now clears the lowest bit of the calculated
target address, to simplify hardware and to allow auxiliary
information to be stored in function pointers.


The MFTX.S and MFTX.D instructions have been renamed to FMV.X.S and
FMV.X.D, respectively. Similarly, MXTF.S and MXTF.D instructions
have been renamed to FMV.S.X and FMV.D.X, respectively.


The MFFSR and MTFSR instructions have been renamed to FRCSR and
FSCSR, respectively. FRRM, FSRM, FRFLAGS, and FSFLAGS instructions
have been added to individually access the rounding mode and
exception flags subfields of the fcsr.


The FMV.X.S and FMV.X.D instructions now source their operands from
rs1, instead of rs2. This change simplifies datapath design.


FCLASS.S and FCLASS.D floating-point classify instructions have been
added.


A simpler NaN generation and propagation scheme has been adopted.


For RV32I, the system performance counters have been extended to
64-bits wide, with separate read access to the upper and lower 32
bits.


Canonical NOP and MV encodings have been defined.


Standard instruction-length encodings have been defined for 48-bit,
64-bit, and >64-bit instructions.


Description of a 128-bit address space variant, RV128, has been
added.


Major opcodes in the 32-bit base instruction format have been
allocated for user-defined custom extensions.


A typographical error that suggested that stores source their data
from rd has been corrected to refer to rs2.


Introduction

RISC-V (pronounced “risk-five”) is a new instruction-set architecture
(ISA) that was originally designed to support computer architecture
research and education, but which we now hope will also become a
standard free and open architecture for industry implementations. Our
goals in defining RISC-V include:


A completely open ISA that is freely available to academia and
industry.


A real ISA suitable for direct native hardware implementation, not
just simulation or binary translation.


An ISA that avoids “over-architecting” for a particular
microarchitecture style (e.g., microcoded, in-order, decoupled,
out-of-order) or implementation technology (e.g., full-custom, ASIC,
FPGA), but which allows efficient implementation in any of these.


An ISA separated into a small base integer ISA, usable by itself
as a base for customized accelerators or for educational purposes,
and optional standard extensions, to support general-purpose
software development.


Support for the revised 2008 IEEE-754 floating-point standard .


An ISA supporting extensive ISA extensions and specialized variants.


Both 32-bit and 64-bit address space variants for applications,
operating system kernels, and hardware implementations.


An ISA with support for highly parallel multicore or manycore
implementations, including heterogeneous multiprocessors.


Optional variable-length instructions to both expand available
instruction encoding space and to support an optional dense
instruction encoding for improved performance, static code size,
and energy efficiency.


A fully virtualizable ISA to ease hypervisor development.


An ISA that simplifies experiments with new privileged architecture
designs.


Commentary on our design decisions is formatted as in this paragraph.
This non-normative text can be skipped if the reader is only interested
in the specification itself.


The name RISC-V was chosen to represent the fifth major RISC ISA design
from UC Berkeley (RISC-I , RISC-II , SOAR , and SPUR  were the first
four). We also pun on the use of the Roman numeral “V” to signify
“variations” and “vectors”, as support for a range of architecture
research, including various data-parallel accelerators, is an explicit
goal of the ISA design.

The RISC-V ISA is defined avoiding implementation details as much as
possible (although commentary is included on implementation-driven
decisions) and should be read as the software-visible interface to a
wide variety of implementations rather than as the design of a
particular hardware artifact. The RISC-V manual is structured in two
volumes. This volume covers the design of the base unprivileged
instructions, including optional unprivileged ISA extensions.
Unprivileged instructions are those that are generally usable in all
privilege modes in all privileged architectures, though behavior might
vary depending on privilege mode and privilege architecture. The second
volume provides the design of the first (“classic”) privileged
architecture. The manuals use IEC 80000-13:2008 conventions, with a byte
of 8 bits.

In the unprivileged ISA design, we tried to remove any dependence on
particular microarchitectural features, such as cache line size, or on
privileged architecture details, such as page translation. This is both
for simplicity and to allow maximum flexibility for alternative
microarchitectures or alternative privileged architectures.

RISC-V Hardware Platform Terminology

A RISC-V hardware platform can contain one or more RISC-V-compatible
processing cores together with other non-RISC-V-compatible cores,
fixed-function accelerators, various physical memory structures, I/O
devices, and an interconnect structure to allow the components to
communicate.
A component is termed a core if it contains an independent instruction
fetch unit. A RISC-V-compatible core might support multiple
RISC-V-compatible hardware threads, or harts, through multithreading.
A RISC-V core might have additional specialized instruction-set
extensions or an added coprocessor. We use the term coprocessor to
refer to a unit that is attached to a RISC-V core and is mostly
sequenced by a RISC-V instruction stream, but which contains additional
architectural state and instruction-set extensions, and possibly some
limited autonomy relative to the primary RISC-V instruction stream.
We use the term accelerator to refer to either a non-programmable
fixed-function unit or a core that can operate autonomously but is
specialized for certain tasks. In RISC-V systems, we expect many
programmable accelerators will be RISC-V-based cores with specialized
instruction-set extensions and/or customized coprocessors. An important
class of RISC-V accelerators are I/O accelerators, which offload I/O
processing tasks from the main application cores.
The system-level organization of a RISC-V hardware platform can range
from a single-core microcontroller to a many-thousand-node cluster of
shared-memory manycore server nodes. Even small systems-on-a-chip might
be structured as a hierarchy of multicomputers and/or multiprocessors to
modularize development effort or to provide secure isolation between
subsystems.
RISC-V Software Execution Environments and Harts

The behavior of a RISC-V program depends on the execution environment in
which it runs. A RISC-V execution environment interface (EEI) defines
the initial state of the program, the number and type of harts in the
environment including the privilege modes supported by the harts, the
accessibility and attributes of memory and I/O regions, the behavior of
all legal instructions executed on each hart (i.e., the ISA is one
component of the EEI), and the handling of any interrupts or exceptions
raised during execution including environment calls. Examples of EEIs
include the Linux application binary interface (ABI), or the RISC-V
supervisor binary interface (SBI). The implementation of a RISC-V
execution environment can be pure hardware, pure software, or a
combination of hardware and software. For example, opcode traps and
software emulation can be used to implement functionality not provided
in hardware. Examples of execution environment implementations include:


“Bare metal” hardware platforms where harts are directly implemented
by physical processor threads and instructions have full access to
the physical address space. The hardware platform defines an
execution environment that begins at power-on reset.


RISC-V operating systems that provide multiple user-level execution
environments by multiplexing user-level harts onto available
physical processor threads and by controlling access to memory via
virtual memory.


RISC-V hypervisors that provide multiple supervisor-level execution
environments for guest operating systems.


RISC-V emulators, such as Spike, QEMU or rv8, which emulate RISC-V
harts on an underlying x86 system, and which can provide either a
user-level or a supervisor-level execution environment.


A bare hardware platform can be considered to define an EEI, where the
accessible harts, memory, and other devices populate the environment,
and the initial state is that at power-on reset. Generally, most
software is designed to use a more abstract interface to the hardware,
as more abstract EEIs provide greater portability across different
hardware platforms. Often EEIs are layered on top of one another, where
one higher-level EEI uses another lower-level EEI.

From the perspective of software running in a given execution
environment, a hart is a resource that autonomously fetches and executes
RISC-V instructions within that execution environment. In this respect,
a hart behaves like a hardware thread resource even if time-multiplexed
onto real hardware by the execution environment. Some EEIs support the
creation and destruction of additional harts, for example, via
environment calls to fork new harts.
The execution environment is responsible for ensuring the eventual
forward progress of each of its harts. For a given hart, that
responsibility is suspended while the hart is exercising a mechanism
that explicitly waits for an event, such as the wait-for-interrupt
instruction defined in Volume II of this specification; and that
responsibility ends if the hart is terminated. The following events
constitute forward progress:


The retirement of an instruction.


A trap, as defined in
Section 1.6.


Any other event defined by an extension to constitute forward
progress.


The term hart was introduced in the work on Lithe  to provide a term to
represent an abstract execution resource as opposed to a software thread
programming abstraction.
The important distinction between a hardware thread (hart) and a
software thread context is that the software running inside an execution
environment is not responsible for causing progress of each of its
harts; that is the responsibility of the outer execution environment. So
the environment’s harts operate like hardware threads from the
perspective of the software inside the execution environment.
An execution environment implementation might time-multiplex a set of
guest harts onto fewer host harts provided by its own execution
environment but must do so in a way that guest harts operate like
independent hardware threads. In particular, if there are more guest
harts than host harts then the execution environment must be able to
preempt the guest harts and must not wait indefinitely for guest
software on a guest hart to “yield" control of the guest hart.

RISC-V ISA Overview

A RISC-V ISA is defined as a base integer ISA, which must be present in
any implementation, plus optional extensions to the base ISA. The base
integer ISAs are very similar to that of the early RISC processors
except with no branch delay slots and with support for optional
variable-length instruction encodings. A base is carefully restricted to
a minimal set of instructions sufficient to provide a reasonable target
for compilers, assemblers, linkers, and operating systems (with
additional privileged operations), and so provides a convenient ISA and
software toolchain “skeleton” around which more customized processor
ISAs can be built.
Although it is convenient to speak of the RISC-V ISA, RISC-V is
actually a family of related ISAs, of which there are currently four
base ISAs. Each base integer instruction set is characterized by the
width of the integer registers and the corresponding size of the address
space and by the number of integer registers. There are two primary base
integer variants, RV32I and RV64I, described in
Chapters [rv32] and
[rv64], which provide 32-bit or 64-bit address
spaces respectively. We use the term XLEN to refer to the width of an
integer register in bits (either 32 or 64).
Chapter [rv32e] describes the RV32E subset variant of
the RV32I base instruction set, which has been added to support small
microcontrollers, and which has half the number of integer registers.
Chapter [rv128] sketches a future RV128I variant of
the base integer instruction set supporting a flat 128-bit address space
(XLEN=128). The base integer instruction sets use a two’s-complement
representation for signed integer values.

Although 64-bit address spaces are a requirement for larger systems, we
believe 32-bit address spaces will remain adequate for many embedded and
client devices for decades to come and will be desirable to lower memory
traffic and energy consumption. In addition, 32-bit address spaces are
sufficient for educational purposes. A larger flat 128-bit address space
might eventually be required, so we ensured this could be accommodated
within the RISC-V ISA framework.


The four base ISAs in RISC-V are treated as distinct base ISAs. A common
question is why is there not a single ISA, and in particular, why is
RV32I not a strict subset of RV64I? Some earlier ISA designs (SPARC,
MIPS) adopted a strict superset policy when increasing address space
size to support running existing 32-bit binaries on new 64-bit hardware.
The main advantage of explicitly separating base ISAs is that each base
ISA can be optimized for its needs without requiring to support all the
operations needed for other base ISAs. For example, RV64I can omit
instructions and CSRs that are only needed to cope with the narrower
registers in RV32I. The RV32I variants can use encoding space otherwise
reserved for instructions only required by wider address-space variants.
The main disadvantage of not treating the design as a single ISA is that
it complicates the hardware needed to emulate one base ISA on another
(e.g., RV32I on RV64I). However, differences in addressing and illegal
instruction traps generally mean some mode switch would be required in
hardware in any case even with full superset instruction encodings, and
the different RISC-V base ISAs are similar enough that supporting
multiple versions is relatively low cost. Although some have proposed
that the strict superset design would allow legacy 32-bit libraries to
be linked with 64-bit code, this is impractical in practice, even with
compatible encodings, due to the differences in software calling
conventions and system-call interfaces.
The RISC-V privileged architecture provides fields in  misa to control
the unprivileged ISA at each level to support emulating different base
ISAs on the same hardware. We note that newer SPARC and MIPS ISA
revisions have deprecated support for running 32-bit code unchanged on
64-bit systems.
A related question is why there is a different encoding for 32-bit adds
in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for
32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the
existing design which uses the same opcode ADD for 32-bit adds in RV32I
and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in
RV64I. This would also be more consistent with the use of the same LW
opcode for 32-bit load in both RV32I and RV64I. The very first versions
of RISC-V ISA did have a variant of this alternate design, but the
RISC-V design was changed to the current choice in January 2011. Our
focus was on supporting 32-bit integers in the 64-bit ISA not on
providing compatibility with the 32-bit ISA, and the motivation was to
remove the asymmetry that arose from having not all opcodes in RV32I
have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was
perhaps not well-justified and a consequence of designing both ISAs at
the same time as opposed to adding one later to sit on top of another,
and also from a belief we had to fold platform requirements into the ISA
spec which would imply that all the RV32I instructions would have been
required in RV64I. It is too late to change the encoding now, but this
is also of little practical consequence for the reasons stated above.
It has been noted we could enable the *W variants as an extension to
RV32I systems to provide a common encoding across RV64I and a future
RV32 variant.

RISC-V has been designed to support extensive customization and
specialization. Each base integer ISA can be extended with one or more
optional instruction-set extensions. An extension may be categorized as
either standard, custom, or non-conforming. For this purpose, we divide
each RISC-V instruction-set encoding space (and related encoding spaces
such as the CSRs) into three disjoint categories: standard,
reserved, and custom. Standard extensions and encodings are defined
by RISC-V International; any extensions not defined by RISC-V
International are non-standard. Each base ISA and its standard
extensions use only standard encodings, and shall not conflict with each
other in their uses of these encodings. Reserved encodings are currently
not defined but are saved for future standard extensions; once thus
used, they become standard encodings. Custom encodings shall never be
used for standard extensions and are made available for vendor-specific
non-standard extensions. Non-standard extensions are either custom
extensions, that use only custom encodings, or non-conforming
extensions, that use any standard or reserved encoding. Instruction-set
extensions are generally shared but may provide slightly different
functionality depending on the base ISA.
Chapter [extensions] describes various ways of
extending the RISC-V ISA. We have also developed a naming convention for
RISC-V base instructions and instruction-set extensions, described in
detail in Chapter [naming].
To support more general software development, a set of standard
extensions are defined to provide integer multiply/divide, atomic
operations, and single and double-precision floating-point arithmetic.
The base integer ISA is named “I” (prefixed by RV32 or RV64 depending on
integer register width), and contains integer computational
instructions, integer loads, integer stores, and control-flow
instructions. The standard integer multiplication and division extension
is named “M”, and adds instructions to multiply and divide values held
in the integer registers. The standard atomic instruction extension,
denoted by “A”, adds instructions that atomically read, modify, and
write memory for inter-processor synchronization. The standard
single-precision floating-point extension, denoted by “F”, adds
floating-point registers, single-precision computational instructions,
and single-precision loads and stores. The standard double-precision
floating-point extension, denoted by “D”, expands the floating-point
registers, and adds double-precision computational instructions, loads,
and stores. The standard “C” compressed instruction extension provides
narrower 16-bit forms of common instructions.
Beyond the base integer ISA and these standard extensions, we believe it
is rare that a new instruction will provide a significant benefit for
all applications, although it may be very beneficial for a certain
domain. As energy efficiency concerns are forcing greater
specialization, we believe it is important to simplify the required
portion of an ISA specification. Whereas other architectures usually
treat their ISA as a single entity, which changes to a new version as
instructions are added over time, RISC-V will endeavor to keep the base
and each standard extension constant over time, and instead layer new
instructions as further optional extensions. For example, the base
integer ISAs will continue as fully supported standalone ISAs,
regardless of any subsequent extensions.
Memory

A RISC-V hart has a single byte-addressable address space of
2^XLEN bytes for all memory accesses. A word of memory is
defined as (). Correspondingly, a halfword is (), a doubleword is
(), and a quadword is (). The memory address space is circular, so
that the byte at address 2^XLEN − 1 is adjacent to the byte at
address zero. Accordingly, memory address computations done by the
hardware ignore overflow and instead wrap around modulo
2^XLEN.
The execution environment determines the mapping of hardware resources
into a hart’s address space. Different address ranges of a hart’s
address space may (1) be vacant, or (2) contain main memory, or
(3) contain one or more I/O devices. Reads and writes of I/O devices
may have visible side effects, but accesses to main memory cannot.
Although it is possible for the execution environment to call everything
in a hart’s address space an I/O device, it is usually expected that
some portion will be specified as main memory.
When a RISC-V platform has multiple harts, the address spaces of any two
harts may be entirely the same, or entirely different, or may be partly
different but sharing some subset of resources, mapped into the same or
different address ranges.

For a purely “bare metal” environment, all harts may see an identical
address space, accessed entirely by physical addresses. However, when
the execution environment includes an operating system employing address
translation, it is common for each hart to be given a virtual address
space that is largely or entirely its own.

Executing each RISC-V machine instruction entails one or more memory
accesses, subdivided into implicit and explicit accesses. For each
instruction executed, an implicit memory read (instruction fetch) is
done to obtain the encoded instruction to execute. Many RISC-V
instructions perform no further memory accesses beyond instruction
fetch. Specific load and store instructions perform an explicit read
or write of memory at an address determined by the instruction. The
execution environment may dictate that instruction execution performs
other implicit memory accesses (such as to implement address
translation) beyond those documented for the unprivileged ISA.
The execution environment determines what portions of the non-vacant
address space are accessible for each kind of memory access. For
example, the set of locations that can be implicitly read for
instruction fetch may or may not have any overlap with the set of
locations that can be explicitly read by a load instruction; and the set
of locations that can be explicitly written by a store instruction may
be only a subset of locations that can be read. Ordinarily, if an
instruction attempts to access memory at an inaccessible address, an
exception is raised for the instruction. Vacant locations in the address
space are never accessible.
Except when specified otherwise, implicit reads that do not raise an
exception may occur arbitrarily early and speculatively, even before the
machine could possibly prove that the read will be needed. For instance,
a valid implementation could attempt to read all of main memory at the
earliest opportunity, cache as many fetchable (executable) bytes as
possible for later instruction fetches, and avoid reading main memory
for instruction fetches ever again. To ensure that certain implicit
reads are ordered only after writes to the same memory locations,
software must execute specific fence or cache-control instructions
defined for this purpose (such as the FENCE.I instruction defined in
Chapter [chap:zifencei]).
The memory accesses (implicit or explicit) made by a hart may appear to
occur in a different order as perceived by another hart or by any other
agent that can access the same memory. This perceived reordering of
memory accesses is always constrained, however, by the applicable memory
consistency model. The default memory consistency model for RISC-V is
the RISC-V Weak Memory Ordering (RVWMO), defined in
Chapter [ch:memorymodel] and in appendices.
Optionally, an implementation may adopt the stronger model of Total
Store Ordering, as defined in
Chapter [sec:ztso]. The execution environment may
also add constraints that further limit the perceived reordering of
memory accesses. Since the RVWMO model is the weakest model allowed for
any RISC-V implementation, software written for this model is compatible
with the actual memory consistency rules of all RISC-V implementations.
As with implicit reads, software must execute fence or cache-control
instructions to ensure specific ordering of memory accesses beyond the
requirements of the assumed memory consistency model and execution
environment.
Base Instruction-Length Encoding

The base RISC-V ISA has fixed-length 32-bit instructions that must be
naturally aligned on 32-bit boundaries. However, the standard RISC-V
encoding scheme is designed to support ISA extensions with
variable-length instructions, where each instruction can be any number
of 16-bit instruction parcels in length and parcels are naturally
aligned on 16-bit boundaries. The standard compressed ISA extension
described in Chapter [compressed] reduces code size by
providing compressed 16-bit instructions and relaxes the alignment
constraints to allow all instructions (16 bit and 32 bit) to be aligned
on any 16-bit boundary to improve code density.
We use the term IALIGN (measured in bits) to refer to the
instruction-address alignment constraint the implementation enforces.
IALIGN is 32 bits in the base ISA, but some ISA extensions, including
the compressed ISA extension, relax IALIGN to 16 bits. IALIGN may not
take on any value other than 16 or 32.
We use the term ILEN (measured in bits) to refer to the maximum
instruction length supported by an implementation, and which is always a
multiple of IALIGN. For implementations supporting only a base
instruction set, ILEN is 32 bits. Implementations supporting longer
instructions have larger values of ILEN.
Figure [instlengthcode] illustrates the
standard RISC-V instruction-length encoding convention. All the 32-bit
instructions in the base ISA have their lowest two bits set to 11. The
optional compressed 16-bit instruction-set extensions have their lowest
two bits equal to 00, 01, or 10.
Expanded Instruction-Length Encoding

A portion of the 32-bit instruction-encoding space has been tentatively
allocated for instructions longer than 32 bits. The entirety of this
space is reserved at this time, and the following proposal for encoding
instructions longer than 32 bits is not considered frozen.
Standard instruction-set extensions encoded with more than 32 bits have
additional low-order bits set to 1, with the conventions for 48-bit
and 64-bit lengths shown in
Figure [instlengthcode]. Instruction
lengths between 80 bits and 176 bits are encoded using a 3-bit field in
bits [14:12] giving the number of 16-bit words in addition to the
first 5×16-bit words. The encoding with bits [14:12] set to
111 is reserved for future longer instruction encodings.


xxxxxxxxxxxxxxaa
16-bit (aa ≠ 11)


xxxxxxxxxxxxxxxx
xxxxxxxxxxxbbb11
32-bit (bbb ≠ 111)


 ⋅  ⋅ ⋅xxxx
xxxxxxxxxxxxxxxx
xxxxxxxxxx011111
48-bit


 ⋅  ⋅ ⋅xxxx
xxxxxxxxxxxxxxxx
xxxxxxxxx0111111
64-bit


 ⋅  ⋅ ⋅xxxx
xxxxxxxxxxxxxxxx
xnnnxxxxx1111111
(80+16*nnn)-bit, nnn≠111


 ⋅  ⋅ ⋅xxxx
xxxxxxxxxxxxxxxx
x111xxxxx1111111
Reserved for ≥192-bits


Byte Address:
base+4
base+2
base


Given the code size and energy savings of a compressed format, we wanted
to build in support for a compressed format to the ISA encoding scheme
rather than adding this as an afterthought, but to allow simpler
implementations we didn’t want to make the compressed format mandatory.
We also wanted to optionally allow longer instructions to support
experimentation and larger instruction-set extensions. Although our
encoding convention required a tighter encoding of the core RISC-V ISA,
this has several beneficial effects.
An implementation of the standard IMAFD ISA need only hold the
most-significant 30 bits in instruction caches (a 6.25% saving). On
instruction cache refills, any instructions encountered with either low
bit clear should be recoded into illegal 30-bit instructions before
storing in the cache to preserve illegal instruction exception behavior.
Perhaps more importantly, by condensing our base ISA into a subset of
the 32-bit instruction word, we leave more space available for
non-standard and custom extensions. In particular, the base RV32I ISA
uses less than 1/8 of the encoding space in the 32-bit instruction word.
As described in Chapter [extensions], an implementation that
does not require support for the standard compressed instruction
extension can map 3 additional non-conforming 30-bit instruction spaces
into the 32-bit fixed-width format, while preserving support for
standard ≥32-bit instruction-set extensions. Further, if the
implementation also does not need instructions >32-bits in
length, it can recover a further four major opcodes for non-conforming
extensions.

Encodings with bits [15:0] all zeros are defined as illegal
instructions. These instructions are considered to be of minimal length:
16 bits if any 16-bit instruction-set extension is present, otherwise 32
bits. The encoding with bits [ILEN-1:0] all ones is also illegal; this
instruction is considered to be ILEN bits long.

We consider it a feature that any length of instruction containing all
zero bits is not legal, as this quickly traps erroneous jumps into
zeroed memory regions. Similarly, we also reserve the instruction
encoding containing all ones to be an illegal instruction, to catch the
other common pattern observed with unprogrammed non-volatile memory
devices, disconnected memory buses, or broken memory devices.
Software can rely on a naturally aligned 32-bit word containing zero to
act as an illegal instruction on all RISC-V implementations, to be used
by software where an illegal instruction is explicitly desired. Defining
a corresponding known illegal value for all ones is more difficult due
to the variable-length encoding. Software cannot generally use the
illegal value of ILEN bits of all 1s, as software might not know ILEN
for the eventual target machine (e.g., if software is compiled into a
standard binary library used by many different machines). Defining a
32-bit word of all ones as illegal was also considered, as all machines
must support a 32-bit instruction size, but this requires the
instruction-fetch unit on machines with ILEN>32 report an
illegal instruction exception rather than an access-fault exception when
such an instruction borders a protection boundary, complicating
variable-instruction-length fetch and decode.

RISC-V base ISAs have either little-endian or big-endian memory systems,
with the privileged architecture further defining bi-endian operation.
Instructions are stored in memory as a sequence of 16-bit little-endian
parcels, regardless of memory system endianness. Parcels forming one
instruction are stored at increasing halfword addresses, with the
lowest-addressed parcel holding the lowest-numbered bits in the
instruction specification.

We originally chose little-endian byte ordering for the RISC-V memory
system because little-endian systems are currently dominant commercially
(all x86 systems; iOS, Android, and Windows for ARM). A minor point is
that we have also found little-endian memory systems to be more natural
for hardware designers. However, certain application areas, such as IP
networking, operate on big-endian data structures, and certain legacy
code bases have been built assuming big-endian processors, so we have
defined big-endian and bi-endian variants of RISC-V.
We have to fix the order in which instruction parcels are stored in
memory, independent of memory system endianness, to ensure that the
length-encoding bits always appear first in halfword address order. This
allows the length of a variable-length instruction to be quickly
determined by an instruction-fetch unit by examining only the first few
bits of the first 16-bit instruction parcel.
We further make the instruction parcels themselves little-endian to
decouple the instruction encoding from the memory system endianness
altogether. This design benefits both software tooling and bi-endian
hardware. Otherwise, for instance, a RISC-V assembler or disassembler
would always need to know the intended active endianness, despite that
in bi-endian systems, the endianness mode might change dynamically
during execution. In contrast, by giving instructions a fixed
endianness, it is sometimes possible for carefully written software to
be endianness-agnostic even in binary form, much like
position-independent code.
The choice to have instructions be only little-endian does have
consequences, however, for RISC-V software that encodes or decodes
machine instructions. Big-endian JIT compilers, for example, must swap
the byte order when storing to instruction memory.
Once we had decided to fix on a little-endian instruction encoding, this
naturally led to placing the length-encoding bits in the LSB positions
of the instruction format to avoid breaking up opcode fields.

Exceptions, Traps, and Interrupts

We use the term exception to refer to an unusual condition occurring
at run time associated with an instruction in the current RISC-V hart.
We use the term interrupt to refer to an external asynchronous event
that may cause a RISC-V hart to experience an unexpected transfer of
control. We use the term trap to refer to the transfer of control to a
trap handler caused by either an exception or an interrupt.
The instruction descriptions in following chapters describe conditions
that can raise an exception during execution. The general behavior of
most RISC-V EEIs is that a trap to some handler occurs when an exception
is signaled on an instruction (except for floating-point exceptions,
which, in the standard floating-point extensions, do not cause traps).
The manner in which interrupts are generated, routed to, and enabled by
a hart depends on the EEI.

Our use of “exception” and “trap” is compatible with that in the
IEEE-754 floating-point standard.

How traps are handled and made visible to software running on the hart
depends on the enclosing execution environment. From the perspective of
software running inside an execution environment, traps encountered by a
hart at runtime can have four different effects:
Contained Trap:

The trap is visible to, and handled by, software running inside the
execution environment. For example, in an EEI providing both supervisor
and user mode on harts, an ECALL by a user-mode hart will generally
result in a transfer of control to a supervisor-mode handler running on
the same hart. Similarly, in the same environment, when a hart is
interrupted, an interrupt handler will be run in supervisor mode on the
hart.
Requested Trap:

The trap is a synchronous exception that is an explicit call to the
execution environment requesting an action on behalf of software inside
the execution environment. An example is a system call. In this case,
execution may or may not resume on the hart after the requested action
is taken by the execution environment. For example, a system call could
remove the hart or cause an orderly termination of the entire execution
environment.
Invisible Trap:

The trap is handled transparently by the execution environment and
execution resumes normally after the trap is handled. Examples include
emulating missing instructions, handling non-resident page faults in a
demand-paged virtual-memory system, or handling device interrupts for a
different job in a multiprogrammed machine. In these cases, the software
running inside the execution environment is not aware of the trap (we
ignore timing effects in these definitions).
Fatal Trap:

The trap represents a fatal failure and causes the execution environment
to terminate execution. Examples include failing a virtual-memory
page-protection check or allowing a watchdog timer to expire. Each EEI
should define how execution is terminated and reported to an external
environment.
Table 1.1 shows the
characteristics of each kind of trap.


Contained
Requested
Invisible
Fatal


Execution terminates
No
No¹
No
Yes


Software is oblivious
No
No
Yes
Yes²


Handled by environment
No
Yes
Yes
Yes


Characteristics of traps. Notes: 1) Termination may be requested. 2)
Imprecise fatal traps might be observable by software.

The EEI defines for each trap whether it is handled precisely, though
the recommendation is to maintain preciseness where possible. Contained
and requested traps can be observed to be imprecise by software inside
the execution environment. Invisible traps, by definition, cannot be
observed to be precise or imprecise by software running inside the
execution environment. Fatal traps can be observed to be imprecise by
software running inside the execution environment, if known-errorful
instructions do not cause immediate termination.
Because this document describes unprivileged instructions, traps are
rarely mentioned. Architectural means to handle contained traps are
defined in the privileged architecture manual, along with other features
to support richer EEIs. Unprivileged instructions that are defined
solely to cause requested traps are documented here. Invisible traps
are, by their nature, out of scope for this document. Instruction
encodings that are not defined here and not defined by some other means
may cause a fatal trap.
UNSPECIFIED Behaviors and Values

The architecture fully describes what implementations must do and any
constraints on what they may do. In cases where the architecture
intentionally does not constrain implementations, the term  is
explicitly used.
The term  refers to a behavior or value that is intentionally
unconstrained. The definition of these behaviors or values is open to
extensions, platform standards, or implementations. Extensions, platform
standards, or implementation documentation may provide normative content
to further constrain cases that the base architecture defines as .
Like the base architecture, extensions should fully describe allowable
behavior and values and use the term  for cases that are intentionally
unconstrained. These cases may be constrained or defined by other
extensions, platform standards, or implementations.
RV32I Base Integer Instruction Set, Version 2.1

This chapter describes the RV32I base integer instruction set.

RV32I was designed to be sufficient to form a compiler target and to
support modern operating system environments. The ISA was also designed
to reduce the hardware required in a minimal implementation. RV32I
contains 40 unique instructions, though a simple implementation might
cover the ECALL/EBREAK instructions with a single SYSTEM hardware
instruction that always traps and might be able to implement the FENCE
instruction as a NOP, reducing base instruction count to 38 total. RV32I
can emulate almost any other ISA extension (except the A extension,
which requires additional hardware support for atomicity).
In practice, a hardware implementation including the machine-mode
privileged architecture will also require the 6 CSR instructions.
Subsets of the base integer ISA might be useful for pedagogical
purposes, but the base has been defined such that there should be little
incentive to subset a real hardware implementation beyond omitting
support for misaligned memory accesses and treating all SYSTEM
instructions as a single trap.


The standard RISC-V assembly language syntax is documented in the
Assembly Programmer’s Manual .


Most of the commentary for RV32I also applies to the RV64I base.

Programmers’ Model for Base Integer ISA

Figure [gprs] shows the unprivileged state for the
base integer ISA. For RV32I, the 32 x registers are each 32 bits wide,
i.e., XLEN=32. Register x0 is hardwired with all bits equal to 0.
General purpose registers x1–x31 hold values that various
instructions interpret as a collection of Boolean values, or as two’s
complement signed binary integers or unsigned binary integers.
There is one additional unprivileged register: the program counter pc
holds the address of the current instruction.


XLEN


XLEN


There is no dedicated stack pointer or subroutine return address link
register in the Base Integer ISA; the instruction encoding allows any
x register to be used for these purposes. However, the standard
software calling convention uses register x1 to hold the return
address for a call, with register x5 available as an alternate link
register. The standard calling convention uses register x2 as the
stack pointer.
Hardware might choose to accelerate function calls and returns that use
x1 or x5. See the descriptions of the JAL and JALR instructions.
The optional compressed 16-bit instruction format is designed around the
assumption that x1 is the return address register and  x2 is the
stack pointer. Software using other conventions will operate correctly
but may have greater code size.


The number of available architectural registers can have large impacts
on code size, performance, and energy consumption. Although 16 registers
would arguably be sufficient for an integer ISA running compiled code,
it is impossible to encode a complete ISA with 16 registers in 16-bit
instructions using a 3-address format. Although a 2-address format would
be possible, it would increase instruction count and lower efficiency.
We wanted to avoid intermediate instruction sizes (such as Xtensa’s
24-bit instructions) to simplify base hardware implementations, and once
a 32-bit instruction size was adopted, it was straightforward to support
32 integer registers. A larger number of integer registers also helps
performance on high-performance code, where there can be extensive use
of loop unrolling, software pipelining, and cache tiling.
For these reasons, we chose a conventional size of 32 integer registers
for RV32I. Dynamic register usage tends to be dominated by a few
frequently accessed registers, and regfile implementations can be
optimized to reduce access energy for the frequently accessed
registers . The optional compressed 16-bit instruction format mostly
only accesses 8 registers and hence can provide a dense instruction
encoding, while additional instruction-set extensions could support a
much larger register space (either flat or hierarchical) if desired.
For resource-constrained embedded applications, we have defined the
RV32E subset, which only has 16 registers
(Chapter [rv32e]).

Base Instruction Formats

In the base RV32I ISA, there are four core instruction formats
(R/I/S/U), as shown in
Figure [fig:baseinstformats]. All are
a fixed 32 bits in length. The base ISA has IALIGN=32, meaning that
instructions must be aligned on a four-byte boundary in memory. An
instruction-address-misaligned exception is generated on a taken branch
or unconditional jump if the target address is not IALIGN-bit aligned.
This exception is reported on the branch or jump instruction, not on the
target instruction. No instruction-address-misaligned exception is
generated for a conditional branch that is not taken.

The alignment constraint for base ISA instructions is relaxed to a
two-byte boundary when instruction extensions with 16-bit lengths or
other odd multiples of 16-bit lengths are added (i.e., IALIGN=16).
Instruction-address-misaligned exceptions are reported on the branch or
jump that would cause instruction misalignment to help debugging, and to
simplify hardware design for systems with IALIGN=32, where these are the
only places where misalignment can occur.

The behavior upon decoding a reserved instruction is .

Some platforms may require that opcodes reserved for standard use raise
an illegal-instruction exception. Other platforms may permit reserved
opcode space be used for non-conforming extensions.


funct7
rs2
rs1
funct3
rd
opcode
R-type


imm[11:0]

rs1
funct3
rd
opcode
I-type


imm[11:5]
rs2
rs1
funct3
imm[4:0]
opcode
S-type


imm[31:12]


rd
opcode
U-type


The RISC-V ISA keeps the source (rs1 and rs2) and destination (rd)
registers at the same position in all formats to simplify decoding.
Except for the 5-bit immediates used in CSR instructions
(Chapter [csrinsts]), immediates are always
sign-extended, and are generally packed towards the leftmost available
bits in the instruction and have been allocated to reduce hardware
complexity. In particular, the sign bit for all immediates is always in
bit 31 of the instruction to speed sign-extension circuitry.

Decoding register specifiers is usually on the critical paths in
implementations, and so the instruction format was chosen to keep all
register specifiers at the same position in all formats at the expense
of having to move immediate bits across formats (a property shared with
RISC-IV aka. SPUR ).
In practice, most immediates are either small or require all XLEN bits.
We chose an asymmetric immediate split (12 bits in regular instructions
plus a special load-upper-immediate instruction with 20 bits) to
increase the opcode space available for regular instructions.
Immediates are sign-extended because we did not observe a benefit to
using zero-extension for some immediates as in the MIPS ISA and wanted
to keep the ISA as simple as possible.

Immediate Encoding Variants

There are a further two variants of the instruction formats (B/J) based
on the handling of immediates, as shown in
Figure [fig:baseinstformatsimm].


funct7

rs2

rs1
funct3
rd

opcode
R-type


imm[11:0]


rs1
funct3
rd

opcode
I-type


imm[11:5]

rs2

rs1
funct3
imm[4:0]

opcode
S-type


imm[12]
imm[10:5]
rs2

rs1
funct3
imm[4:1]
imm[11]
opcode
B-type


imm[31:12]


rd

opcode
U-type


imm[20]
imm[10:1]

imm[11]
imm[19:12]

rd

opcode
J-type


The only difference between the S and B formats is that the 12-bit
immediate field is used to encode branch offsets in multiples of 2 in
the B format. Instead of shifting all bits in the instruction-encoded
immediate left by one in hardware as is conventionally done, the middle
bits (imm[10:1]) and sign bit stay in fixed positions, while the
lowest bit in S format (inst[7]) encodes a high-order bit in B format.
Similarly, the only difference between the U and J formats is that the
20-bit immediate is shifted left by 12 bits to form U immediates and by
1 bit to form J immediates. The location of instruction bits in the U
and J format immediates is chosen to maximize overlap with the other
formats and with each other.
Figure [fig:immtypes] shows the immediates
produced by each of the base instruction formats, and is labeled to show
which instruction bit (inst[y ]) produces each bit of the immediate
value.


— inst[31] —


inst[30:25]
inst[24:21]
inst[20]
I-immediate


— inst[31] —


inst[30:25]
inst[11:8]
inst[7]
S-immediate


— inst[31] —


inst[7]
inst[30:25]
inst[11:8]
0
B-immediate


inst[31]
inst[30:20]
inst[19:12]
— 0 —


U-immediate


— inst[31] —

inst[19:12]
inst[20]
inst[30:25]
inst[24:21]
0
J-immediate


Sign-extension is one of the most critical operations on immediates
(particularly for XLEN>32), and in RISC-V the sign bit for all
immediates is always held in bit 31 of the instruction to allow
sign-extension to proceed in parallel with instruction decoding.
Although more complex implementations might have separate adders for
branch and jump calculations and so would not benefit from keeping the
location of immediate bits constant across types of instruction, we
wanted to reduce the hardware cost of the simplest implementations. By
rotating bits in the instruction encoding of B and J immediates instead
of using dynamic hardware muxes to multiply the immediate by 2, we
reduce instruction signal fanout and immediate mux costs by around a
factor of 2. The scrambled immediate encoding will add negligible time
to static or ahead-of-time compilation. For dynamic generation of
instructions, there is some small additional overhead, but the most
common short forward branches have straightforward immediate encodings.

Integer Computational Instructions

Most integer computational instructions operate on XLEN bits of values
held in the integer register file. Integer computational instructions
are either encoded as register-immediate operations using the I-type
format or as register-register operations using the R-type format. The
destination is register rd for both register-immediate and
register-register instructions. No integer computational instructions
cause arithmetic exceptions.

We did not include special instruction-set support for overflow checks
on integer arithmetic operations in the base instruction set, as many
overflow checks can be cheaply implemented using RISC-V branches.
Overflow checking for unsigned addition requires only a single
additional branch instruction after the addition:
 add t0, t1, t2; bltu t0, t1, overflow.
For signed addition, if one operand’s sign is known, overflow checking
requires only a single branch after the addition:
 addi t0, t1, +imm; blt t0, t1, overflow. This covers the common case
of addition with an immediate operand.
For general signed addition, three additional instructions after the
addition are required, leveraging the observation that the sum should be
less than one of the operands if and only if the other operand is
negative.
         add t0, t1, t2
         slti t3, t2, 0
         slt t4, t0, t1
         bne t3, t4, overflow

In RV64I, checks of 32-bit signed additions can be optimized further by
comparing the results of ADD and ADDW on the operands.

Integer Register-Immediate Instructions


| M | R | S | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| I-immediate[11:0] | src | ADDI/SLTI[U] | dest | OP-IMM

| I-immediate[11:0] | src | ANDI/ORI/XORI | dest | OP-IMM


ADDI adds the sign-extended 12-bit immediate to register rs1.
Arithmetic overflow is ignored and the result is simply the low XLEN
bits of the result. ADDI rd, rs1, 0 is used to implement the MV rd,
rs1 assembler pseudoinstruction.
SLTI (set less than immediate) places the value 1 in register rd if
register rs1 is less than the sign-extended immediate when both are
treated as signed numbers, else 0 is written to rd. SLTIU is similar
but compares the values as unsigned numbers (i.e., the immediate is
first sign-extended to XLEN bits then treated as an unsigned number).
Note, SLTIU rd, rs1, 1 sets rd to 1 if rs1 equals zero, otherwise
sets rd to 0 (assembler pseudoinstruction SEQZ rd, rs).
ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and
XOR on register rs1 and the sign-extended 12-bit immediate and place
the result in rd. Note, XORI rd, rs1, -1 performs a bitwise logical
inversion of register rs1 (assembler pseudoinstruction NOT rd, rs).


| S | R | R | S | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| 0000000 | shamt[4:0] | src | SLLI | dest | OP-IMM

| 0000000 | shamt[4:0] | src | SRLI | dest | OP-IMM

| 0100000 | shamt[4:0] | src | SRAI | dest | OP-IMM


Shifts by a constant are encoded as a specialization of the I-type
format. The operand to be shifted is in rs1, and the shift amount is
encoded in the lower 5 bits of the I-immediate field. The right shift
type is encoded in bit 30. SLLI is a logical left shift (zeros are
shifted into the lower bits); SRLI is a logical right shift (zeros are
shifted into the upper bits); and SRAI is an arithmetic right shift (the
original sign bit is copied into the vacated upper bits).


| U | R | O

|:- |:-
| | |

| | |

| | 5 | 7

| U-immediate[31:12] | dest | LUI

| U-immediate[31:12] | dest | AUIPC


LUI (load upper immediate) is used to build 32-bit constants and uses
the U-type format. LUI places the 32-bit U-immediate value into the
destination register rd, filling in the lowest 12 bits with zeros.
AUIPC (add upper immediate to pc) is used to build pc-relative
addresses and uses the U-type format. AUIPC forms a 32-bit offset from
the U-immediate, filling in the lowest 12 bits with zeros, adds this
offset to the address of the AUIPC instruction, then places the result
in register rd.

The assembly syntax for lui and auipc does not represent the lower
12 bits of the U-immediate, which are always zero.
The AUIPC instruction supports two-instruction sequences to access
arbitrary offsets from the pc for both control-flow transfers and data
accesses. The combination of an AUIPC and the 12-bit immediate in a JALR
can transfer control to any 32-bit pc-relative address, while an AUIPC
plus the 12-bit immediate offset in regular load or store instructions
can access any 32-bit pc-relative data address.
The current pc can be obtained by setting the U-immediate to 0.
Although a JAL +4 instruction could also be used to obtain the local
pc (of the instruction following the JAL), it might cause pipeline
breaks in simpler microarchitectures or pollute branch-target buffer
structures in more complex microarchitectures.

Integer Register-Register Operations

RV32I defines several arithmetic R-type operations. All operations read
the rs1 and rs2 registers as source operands and write the result
into register rd. The funct7 and funct3 fields select the type of
operation.


| S | R | R | S | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| 0000000 | src2 | src1 | ADD/SLT[U] | dest | OP

| 0000000 | src2 | src1 | AND/OR/XOR | dest | OP

| 0000000 | src2 | src1 | SLL/SRL | dest | OP

| 0100000 | src2 | src1 | SUB/SRA | dest | OP


ADD performs the addition of rs1 and rs2. SUB performs the
subtraction of rs2 from rs1. Overflows are ignored and the low XLEN
bits of results are written to the destination rd. SLT and SLTU
perform signed and unsigned compares respectively, writing 1 to rd if
$\mbox{\em rs1} &lt; \mbox{\em
rs2}$, 0 otherwise. Note, SLTU rd, x0, rs2 sets rd to 1 if
rs2 is not equal to zero, otherwise sets rd to zero (assembler
pseudoinstruction SNEZ rd, rs). AND, OR, and XOR perform bitwise
logical operations.
SLL, SRL, and SRA perform logical left, logical right, and arithmetic
right shifts on the value in register rs1 by the shift amount held in
the lower 5 bits of register rs2.
NOP Instruction


| M | R | S | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| 0 | 0 | ADDI | 0 | OP-IMM


The NOP instruction does not change any architecturally visible state,
except for advancing the pc and incrementing any applicable
performance counters. NOP is encoded as ADDI x0, x0, 0.

NOPs can be used to align code segments to microarchitecturally
significant address boundaries, or to leave space for inline code
modifications. Although there are many possible ways to encode a NOP, we
define a canonical NOP encoding to allow microarchitectural
optimizations as well as for more readable disassembly output. The other
NOP encodings are made available for HINT instructions
(Section 1.9).
ADDI was chosen for the NOP encoding as this is most likely to take
fewest resources to execute across a range of systems (if not optimized
away in decode). In particular, the instruction only reads one register.
Also, an ADDI functional unit is more likely to be available in a
superscalar design as adds are the most common operation. In particular,
address-generation functional units can execute ADDI using the same
hardware needed for base+offset address calculations, while
register-register ADD or logical/shift operations require additional
hardware.

Control Transfer Instructions

RV32I provides two types of control transfer instructions: unconditional
jumps and conditional branches. Control transfer instructions in RV32I
do not have architecturally visible delay slots.
If an instruction access-fault or instruction page-fault exception
occurs on the target of a jump or taken branch, the exception is
reported on the target instruction, not on the jump or branch
instruction.
Unconditional Jumps

The jump and link (JAL) instruction uses the J-type format, where the
J-immediate encodes a signed offset in multiples of 2 bytes. The offset
is sign-extended and added to the address of the jump instruction to
form the jump target address. Jumps can therefore target a ± range. JAL
stores the address of the instruction that follows the JAL (pc+4) into
register rd. The standard software calling convention uses x1 as the
return address register and x5 as an alternate link register.

The alternate link register supports calling millicode routines (e.g.,
those to save and restore registers in compressed code) while preserving
the regular return address register. The register x5 was chosen as the
alternate link register as it maps to a temporary in the standard
calling convention, and has an encoding that is only one bit different
than the regular link register.

Plain unconditional jumps (assembler pseudoinstruction J) are encoded as
a JAL with rd=x0.


| W | E | W | R | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 10 | | 8 | 5 | 7

| | dest | JAL


The indirect jump instruction JALR (jump and link register) uses the
I-type encoding. The target address is obtained by adding the
sign-extended 12-bit I-immediate to the register rs1, then setting the
least-significant bit of the result to zero. The address of the
instruction following the jump (pc+4) is written to register rd.
Register x0 can be used as the destination if the result is not
required.


| M | R | F | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | 0 | dest | JALR


The unconditional jump instructions all use pc-relative addressing to
help support position-independent code. The JALR instruction was defined
to enable a two-instruction sequence to jump anywhere in a 32-bit
absolute address range. A LUI instruction can first load rs1 with the
upper 20 bits of a target address, then JALR can add in the lower bits.
Similarly, AUIPC then JALR can jump anywhere in a 32-bit pc-relative
address range.
Note that the JALR instruction does not treat the 12-bit immediate as
multiples of 2 bytes, unlike the conditional branch instructions. This
avoids one more immediate format in hardware. In practice, most uses of
JALR will have either a zero immediate or be paired with a LUI or AUIPC,
so the slight reduction in range is not significant.
Clearing the least-significant bit when calculating the JALR target
address both simplifies the hardware slightly and allows the low bit of
function pointers to be used to store auxiliary information. Although
there is potentially a slight loss of error checking in this case, in
practice jumps to an incorrect instruction address will usually quickly
raise an exception.
When used with a base rs1=x0, JALR can be used to implement a single
instruction subroutine call to the lowest or highest address region from
anywhere in the address space, which could be used to implement fast
calls to a small runtime library. Alternatively, an ABI could dedicate a
general-purpose register to point to a library elsewhere in the address
space.

The JAL and JALR instructions will generate an
instruction-address-misaligned exception if the target address is not
aligned to an IALIGN-bit boundary.

Instruction-address-misaligned exceptions are not possible on machines
with IALIGN=16, such as those that support the compressed
instruction-set extension, C.

Return-address prediction stacks are a common feature of
high-performance instruction-fetch units, but require accurate detection
of instructions used for procedure calls and returns to be effective.
For RISC-V, hints as to the instructions’ usage are encoded implicitly
via the register numbers used. A JAL instruction should push the return
address onto a return-address stack (RAS) only when rd is x1 or
x5. JALR instructions should push/pop a RAS as shown in the
Table 1.1.


rd is x1/x5
rs1 is x1/x5
rd=rs1
RAS action


No
No
–
None


No
Yes
–
Pop


Yes
No
–
Push


Yes
Yes
No
Pop, then push


Yes
Yes
Yes
Push


Return-address stack prediction hints encoded in the register operands
of a JALR instruction.


Some other ISAs added explicit hint bits to their indirect-jump
instructions to guide return-address stack manipulation. We use implicit
hinting tied to register numbers and the calling convention to reduce
the encoding space used for these hints.
When two different link registers (x1 and x5) are given as rs1 and
rd, then the RAS is both popped and pushed to support coroutines. If
rs1 and rd are the same link register (either x1 or x5), the RAS
is only pushed to enable macro-op fusion of the sequences:
lui ra, imm20; jalr ra, imm12(ra)  and
 auipc ra, imm20; jalr ra, imm12(ra)

Conditional Branches

All branch instructions use the B-type instruction format. The 12-bit
B-immediate encodes signed offsets in multiples of 2 bytes. The offset
is sign-extended and added to the address of the branch instruction to
give the target address. The conditional branch range is ±.


| W | R | F | F | R | R | F | S

|:- |:- |:- |:- |:- |:- |:-
| | | | | | | |

| | | | | | | |

| | 6 | 5 | 5 | 3 | 4 | 1 | 7

| | src2 | src1 | BEQ/BNE | | BRANCH

| | src2 | src1 | BLT[U] | | BRANCH

| | src2 | src1 | BGE[U] | | BRANCH


Branch instructions compare two registers. BEQ and BNE take the branch
if registers rs1 and rs2 are equal or unequal respectively. BLT and
BLTU take the branch if rs1 is less than rs2, using signed and
unsigned comparison respectively. BGE and BGEU take the branch if rs1
is greater than or equal to rs2, using signed and unsigned comparison
respectively. Note, BGT, BGTU, BLE, and BLEU can be synthesized by
reversing the operands to BLT, BLTU, BGE, and BGEU, respectively.

Signed array bounds may be checked with a single BLTU instruction, since
any negative index will compare greater than any nonnegative bound.

Software should be optimized such that the sequential code path is the
most common path, with less-frequently taken code paths placed out of
line. Software should also assume that backward branches will be
predicted taken and forward branches as not taken, at least the first
time they are encountered. Dynamic predictors should quickly learn any
predictable branch behavior.
Unlike some other architectures, the RISC-V jump (JAL with rd=x0)
instruction should always be used for unconditional branches instead of
a conditional branch instruction with an always-true condition. RISC-V
jumps are also pc-relative and support a much wider offset range than
branches, and will not pollute conditional-branch prediction tables.

The conditional branches were designed to include arithmetic comparison
operations between two registers (as also done in PA-RISC, Xtensa, and
MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or
to only compare one register against zero (Alpha, MIPS), or two
registers only for equality (MIPS). This design was motivated by the
observation that a combined compare-and-branch instruction fits into a
regular pipeline, avoids additional condition code state or use of a
temporary register, and reduces static code size and dynamic instruction
fetch traffic. Another point is that comparisons against zero require
non-trivial circuit delay (especially after the move to static logic in
advanced processes) and so are almost as expensive as arithmetic
magnitude compares. Another advantage of a fused compare-and-branch
instruction is that branches are observed earlier in the front-end
instruction stream, and so can be predicted earlier. There is perhaps an
advantage to a design with condition codes in the case where multiple
branches can be taken based on the same condition codes, but we believe
this case to be relatively rare.
We considered but did not include static branch hints in the instruction
encoding. These can reduce the pressure on dynamic predictors, but
require more instruction encoding space and software profiling for best
results, and can result in poor performance if production runs do not
match profiling runs.
We considered but did not include conditional moves or predicated
instructions, which can effectively replace unpredictable short forward
branches. Conditional moves are the simpler of the two, but are
difficult to use with conditional code that might cause exceptions
(memory accesses and floating-point operations). Predication adds
additional flag state to a system, additional instructions to set and
clear flags, and additional encoding overhead on every instruction. Both
conditional move and predicated instructions add complexity to
out-of-order microarchitectures, adding an implicit third source operand
due to the need to copy the original value of the destination
architectural register into the renamed destination physical register if
the predicate is false. Also, static compile-time decisions to use
predication instead of branches can result in lower performance on
inputs not included in the compiler training set, especially given that
unpredictable branches are rare, and becoming rarer as branch prediction
techniques improve.
We note that various microarchitectural techniques exist to dynamically
convert unpredictable short forward branches into internally predicated
code to avoid the cost of flushing pipelines on a branch mispredict  and
have been implemented in commercial processors . The simplest techniques
just reduce the penalty of recovering from a mispredicted short forward
branch by only flushing instructions in the branch shadow instead of the
entire fetch pipeline, or by fetching instructions from both sides using
wide instruction fetch or idle instruction fetch slots. More complex
techniques for out-of-order cores add internal predicates on
instructions in the branch shadow, with the internal predicate value
written by the branch instruction, allowing the branch and following
instructions to be executed speculatively and out-of-order with respect
to other code .

The conditional branch instructions will generate an
instruction-address-misaligned exception if the target address is not
aligned to an IALIGN-bit boundary and the branch condition evaluates to
true. If the branch condition evaluates to false, the
instruction-address-misaligned exception will not be raised.

Instruction-address-misaligned exceptions are not possible on machines
with IALIGN=16, such as those that support the compressed
instruction-set extension, C.

Load and Store Instructions

RV32I is a load-store architecture, where only load and store
instructions access memory and arithmetic instructions only operate on
CPU registers. RV32I provides a 32-bit address space that is
byte-addressed. The EEI will define what portions of the address space
are legal to access with which instructions (e.g., some addresses might
be read only, or support word access only). Loads with a destination of
x0 must still raise any exceptions and cause any other side effects
even though the load value is discarded.
The EEI will define whether the memory system is little-endian or
big-endian. In RISC-V, endianness is byte-address invariant.

In a system for which endianness is byte-address invariant, the
following property holds: if a byte is stored to memory at some address
in some endianness, then a byte-sized load from that address in any
endianness returns the stored value.
In a little-endian configuration, multibyte stores write the
least-significant register byte at the lowest memory byte address,
followed by the other register bytes in ascending order of their
significance. Loads similarly transfer the contents of the lesser memory
byte addresses to the less-significant register bytes.
In a big-endian configuration, multibyte stores write the
most-significant register byte at the lowest memory byte address,
followed by the other register bytes in descending order of their
significance. Loads similarly transfer the contents of the greater
memory byte addresses to the less-significant register bytes.


| M | R | F | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | width | dest | LOAD


| O | R | R | F | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| offset[11:5] | src | base | width | offset[4:0] | STORE


Load and store instructions transfer a value between the registers and
memory. Loads are encoded in the I-type format and stores are S-type.
The effective address is obtained by adding register rs1 to the
sign-extended 12-bit offset. Loads copy a value from memory to register
rd. Stores copy the value in register rs2 to memory.
The LW instruction loads a 32-bit value from memory into rd. LH loads
a 16-bit value from memory, then sign-extends to 32-bits before storing
in rd. LHU loads a 16-bit value from memory but then zero extends to
32-bits before storing in rd. LB and LBU are defined analogously for
8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and
8-bit values from the low bits of register rs2 to memory.
Regardless of EEI, loads and stores whose effective addresses are
naturally aligned shall not raise an address-misaligned exception. Loads
and stores whose effective address is not naturally aligned to the
referenced datatype (i.e., the effective address is not divisible by the
size of the access in bytes) have behavior dependent on the EEI.
An EEI may guarantee that misaligned loads and stores are fully
supported, and so the software running inside the execution environment
will never experience a contained or fatal address-misaligned trap. In
this case, the misaligned loads and stores can be handled in hardware,
or via an invisible trap into the execution environment implementation,
or possibly a combination of hardware and invisible trap depending on
address.
An EEI may not guarantee misaligned loads and stores are handled
invisibly. In this case, loads and stores that are not naturally aligned
may either complete execution successfully or raise an exception. The
exception raised can be either an address-misaligned exception or an
access-fault exception. For a memory access that would otherwise be able
to complete except for the misalignment, an access-fault exception can
be raised instead of an address-misaligned exception if the misaligned
access should not be emulated, e.g., if accesses to the memory region
have side effects. When an EEI does not guarantee misaligned loads and
stores are handled invisibly, the EEI must define if exceptions caused
by address misalignment result in a contained trap (allowing software
running inside the execution environment to handle the trap) or a fatal
trap (terminating execution).

Misaligned accesses are occasionally required when porting legacy code,
and help performance on applications when using any form of packed-SIMD
extension or handling externally packed data structures. Our rationale
for allowing EEIs to choose to support misaligned accesses via the
regular load and store instructions is to simplify the addition of
misaligned hardware support. One option would have been to disallow
misaligned accesses in the base ISAs and then provide some separate ISA
support for misaligned accesses, either special instructions to help
software handle misaligned accesses or a new hardware addressing mode
for misaligned accesses. Special instructions are difficult to use,
complicate the ISA, and often add new processor state (e.g., SPARC VIS
align address offset register) or complicate access to existing
processor state (e.g., MIPS LWL/LWR partial register writes). In
addition, for loop-oriented packed-SIMD code, the extra overhead when
operands are misaligned motivates software to provide multiple forms of
loop depending on operand alignment, which complicates code generation
and adds to loop startup overhead. New misaligned hardware addressing
modes take considerable space in the instruction encoding or require
very simplified addressing modes (e.g., register indirect only).

Even when misaligned loads and stores complete successfully, these
accesses might run extremely slowly depending on the implementation
(e.g., when implemented via an invisible trap). Furthermore, whereas
naturally aligned loads and stores are guaranteed to execute atomically,
misaligned loads and stores might not, and hence require additional
synchronization to ensure atomicity.

We do not mandate atomicity for misaligned accesses so execution
environment implementations can use an invisible machine trap and a
software handler to handle some or all misaligned accesses. If hardware
misaligned support is provided, software can exploit this by simply
using regular load and store instructions. Hardware can then
automatically optimize accesses depending on whether runtime addresses
are aligned.

Memory Ordering Instructions


| F | IIIIIIIIF | F | F | S

|:- |:- |:- |:-
| | | | | | | | | | | | |

| | | | | | | | | | | | |

| | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 3 | 5 | 7

| FM | | | 0 | FENCE | 0 | MISC-MEM


The FENCE instruction is used to order device I/O and memory accesses as
viewed by other RISC-V harts and external devices or coprocessors. Any
combination of device input (I), device output (O), memory reads (R),
and memory writes (W) may be ordered with respect to any combination of
the same. Informally, no other RISC-V hart or external device can
observe any operation in the successor set following a FENCE before
any operation in the predecessor set preceding the FENCE.
Chapter [ch:memorymodel] provides a precise
description of the RISC-V memory consistency model.
The FENCE instruction also orders memory reads and writes made by the
hart as observed by memory reads and writes made by an external device.
However, FENCE does not order observations of events made by an external
device using any other signaling mechanism.

A device might observe an access to a memory location via some external
communication mechanism, e.g., a memory-mapped control register that
drives an interrupt signal to an interrupt controller. This
communication is outside the scope of the FENCE ordering mechanism and
hence the FENCE instruction can provide no guarantee on when a change in
the interrupt signal is visible to the interrupt controller. Specific
devices might provide additional ordering guarantees to reduce software
overhead but those are outside the scope of the RISC-V memory model.

The EEI will define what I/O operations are possible, and in particular,
which memory addresses when accessed by load and store instructions will
be treated and ordered as device input and device output operations
respectively rather than memory reads and writes. For example,
memory-mapped I/O devices will typically be accessed with uncached loads
and stores that are ordered using the I and O bits rather than the R and
W bits. Instruction-set extensions might also describe new I/O
instructions that will also be ordered using the I and O bits in a
FENCE.


fm field
Mnemonic
Meaning


0000
none
Normal Fence


1000
TSO
With FENCE RW,RW: exclude write-to-read ordering


Otherwise: Reserved for future use.


other

Reserved for future use.


Fence mode encoding.


The fence mode field fm defines the semantics of the FENCE. A FENCE
with fm=0000 orders all memory operations in its predecessor set
before all memory operations in its successor set.
The FENCE.TSO instruction is encoded as a FENCE instruction with
fm=1000, predecessor=RW, and successor=RW. FENCE.TSO orders all
load operations in its predecessor set before all memory operations in
its successor set, and all store operations in its predecessor set
before all store operations in its successor set. This leaves non-AMO
store operations in the FENCE.TSO’s predecessor set unordered with
non-AMO loads in its successor set.

Because FENCE RW,RW imposes a superset of the orderings that FENCE.TSO
imposes, it is correct to ignore the fm field and implement FENCE.TSO
as FENCE RW,RW.

The unused fields in the FENCE instructions—rs1 and rd—are reserved
for finer-grain fences in future extensions. For forward compatibility,
base implementations shall ignore these fields, and standard software
shall zero these fields. Likewise, many fm and predecessor/successor
set settings in
Table 1.2
are also reserved for future use. Base implementations shall treat all
such reserved configurations as normal fences with fm=0000, and
standard software shall use only non-reserved configurations.

We chose a relaxed memory model to allow high performance from simple
machine implementations and from likely future coprocessor or
accelerator extensions. We separate out I/O ordering from memory R/W
ordering to avoid unnecessary serialization within a device-driver hart
and also to support alternative non-memory paths to control added
coprocessors or I/O devices. Simple implementations may additionally
ignore the predecessor and successor fields and always execute a
conservative fence on all operations.

Environment Call and Breakpoints

SYSTEM instructions are used to access system functionality that might
require privileged access and are encoded using the I-type instruction
format. These can be divided into two main classes: those that
atomically read-modify-write control and status registers (CSRs), and
all other potentially privileged instructions. CSR instructions are
described in Chapter [csrinsts], and the base unprivileged
instructions are described in the following section.

The SYSTEM instructions are defined to allow simpler implementations to
always trap to a single software trap handler. More sophisticated
implementations might execute more of each system instruction in
hardware.


| M | R | F | R | S

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| ECALL | 0 | PRIV | 0 | SYSTEM

| EBREAK | 0 | PRIV | 0 | SYSTEM


These two instructions cause a precise requested trap to the supporting
execution environment.
The ECALL instruction is used to make a service request to the execution
environment. The EEI will define how parameters for the service request
are passed, but usually these will be in defined locations in the
integer register file.
The EBREAK instruction is used to return control to a debugging
environment.

ECALL and EBREAK were previously named SCALL and SBREAK. The
instructions have the same functionality and encoding, but were renamed
to reflect that they can be used more generally than to call a
supervisor-level operating system or debugger.


EBREAK was primarily designed to be used by a debugger to cause
execution to stop and fall back into the debugger. EBREAK is also used
by the standard gcc compiler to mark code paths that should not be
executed.
Another use of EBREAK is to support “semihosting”, where the execution
environment includes a debugger that can provide services over an
alternate system call interface built around the EBREAK instruction.
Because the RISC-V base ISAs do not provide more than one EBREAK
instruction, RISC-V semihosting uses a special sequence of instructions
to distinguish a semihosting EBREAK from a debugger inserted EBREAK.
    slli x0, x0, 0x1f   # Entry NOP
    ebreak              # Break to debugger
    srai x0, x0, 7      # NOP encoding the semihosting call number 7

Note that these three instructions must be 32-bit-wide instructions,
i.e., they mustn’t be among the compressed 16-bit instructions described
in Chapter [compressed].
The shift NOP instructions are still considered available for use as
HINTs.
Semihosting is a form of service call and would be more naturally
encoded as an ECALL using an existing ABI, but this would require the
debugger to be able to intercept ECALLs, which is a newer addition to
the debug standard. We intend to move over to using ECALLs with a
standard ABI, in which case, semihosting can share a service ABI with an
existing standard.
We note that ARM processors have also moved to using SVC instead of BKPT
for semihosting calls in newer designs.

HINT Instructions

RV32I reserves a large encoding space for HINT instructions, which are
usually used to communicate performance hints to the microarchitecture.
Like the NOP instruction, HINTs do not change any architecturally
visible state, except for advancing the pc and any applicable
performance counters. Implementations are always allowed to ignore the
encoded hints.
Most RV32I HINTs are encoded as integer computational instructions with
rd=x0. The other RV32I HINTs are encoded as FENCE instructions with
a null predecessor or successor set and with fm=0.

These HINT encodings have been chosen so that simple implementations can
ignore HINTs altogether, and instead execute a HINT as a regular
instruction that happens not to mutate the architectural state. For
example, ADD is a HINT if the destination register is x0; the five-bit
rs1 and rs2 fields encode arguments to the HINT. However, a simple
implementation can simply execute the HINT as an ADD of rs1 and rs2
that writes  x0, which has no architecturally visible effect.
As another example, a FENCE instruction with a zero pred field and a
zero fm field is a HINT; the succ, rs1, and rd fields encode the
arguments to the HINT. A simple implementation can simply execute the
HINT as a FENCE that orders the null set of prior memory accesses before
whichever subsequent memory accesses are encoded in the succ field.
Since the intersection of the predecessor and successor sets is null,
the instruction imposes no memory orderings, and so it has no
architecturally visible effect.

Table [tab:rv32i-hints] lists all RV32I
HINT code points. 91% of the HINT space is reserved for standard HINTs.
The remainder of the HINT space is designated for custom HINTs: no
standard HINTs will ever be defined in this subspace.

We anticipate standard hints to eventually include memory-system spatial
and temporal locality hints, branch prediction hints, thread-scheduling
hints, security tags, and instrumentation flags for
simulation/emulation.


| |l|l|c|l| Instruction | Constraints | Code Points | Purpose

| LUI | rd=x0 | 2²⁰ |

| AUIPC | rd=x0 | 2²⁰ |

| | rd=x0, and either | |

| | rs1≠x0 or imm≠0 | |

| ANDI | rd=x0 | 2¹⁷ |

| ORI | rd=x0 | 2¹⁷ |

| XORI | rd=x0 | 2¹⁷ |

| ADD | rd=x0, rs1≠x0 | 2¹⁰ − 32 |

| | rd=x0, rs1=x0, | |

| | rs2≠x2–x5 | |

| | | | (rs2=x2) NTL.P1

| | | | (rs2=x3) NTL.PALL

| | | | (rs2=x4) NTL.S1

| | | | (rs2=x5) NTL.ALL

| SUB | rd=x0 | 2¹⁰ |

| AND | rd=x0 | 2¹⁰ |

| OR | rd=x0 | 2¹⁰ |

| XOR | rd=x0 | 2¹⁰ |

| SLL | rd=x0 | 2¹⁰ |

| SRL | rd=x0 | 2¹⁰ |

| SRA | rd=x0 | 2¹⁰ |

| | rd=x0, rs1≠x0, | |

| | fm=0, and either | |

| | pred=0 or succ=0 | |

| | rd≠x0, rs1=x0, | |

| | fm=0, and either | |

| | pred=0 or succ=0 | |

| | rd=rs1=x0, fm=0, | |

| | pred=0, succ≠0 | |

| | rd=rs1=x0, fm=0, | |

| | pred≠W, succ=0 | |

| | rd=rs1=x0, fm=0, | |

| | pred=W, succ=0 | |

| SLTI | rd=x0 | 2¹⁷ |

| SLTIU | rd=x0 | 2¹⁷ |

| SLLI | rd=x0 | 2¹⁰ |

| SRLI | rd=x0 | 2¹⁰ |

| SRAI | rd=x0 | 2¹⁰ |

| SLT | rd=x0 | 2¹⁰ |

| SLTU | rd=x0 | 2¹⁰ |

# “Zifencei” Instruction-Fetch Fence, Version 2.0
This chapter defines the “Zifencei” extension, which includes the
FENCE.I instruction that provides explicit synchronization between
writes to instruction memory and instruction fetches on the same hart.
Currently, this instruction is the only standard mechanism to ensure
that stores visible to a hart will also be visible to its instruction
fetches.

We considered but did not include a “store instruction word” instruction
(as in MAJC ). JIT compilers may generate a large trace of instructions
before a single FENCE.I, and amortize any instruction cache
snooping/invalidation overhead by writing translated instructions to
memory regions that are known not to reside in the I-cache.


The FENCE.I instruction was designed to support a wide variety of
implementations. A simple implementation can flush the local instruction
cache and the instruction pipeline when the FENCE.I is executed. A more
complex implementation might snoop the instruction (data) cache on every
data (instruction) cache miss, or use an inclusive unified private L2
cache to invalidate lines from the primary instruction cache when they
are being written by a local store instruction. If instruction and data
caches are kept coherent in this way, or if the memory system consists
of only uncached RAMs, then just the fetch pipeline needs to be flushed
at a FENCE.I.
The FENCE.I instruction was previously part of the base I instruction
set. Two main issues are driving moving this out of the mandatory base,
although at time of writing it is still the only standard method for
maintaining instruction-fetch coherence.
First, it has been recognized that on some systems, FENCE.I will be
expensive to implement and alternate mechanisms are being discussed in
the memory model task group. In particular, for designs that have an
incoherent instruction cache and an incoherent data cache, or where the
instruction cache refill does not snoop a coherent data cache, both
caches must be completely flushed when a FENCE.I instruction is
encountered. This problem is exacerbated when there are multiple levels
of I and D cache in front of a unified cache or outer memory system.
Second, the instruction is not powerful enough to make available at user
level in a Unix-like operating system environment. The FENCE.I only
synchronizes the local hart, and the OS can reschedule the user hart to
a different physical hart after the FENCE.I. This would require the OS
to execute an additional FENCE.I as part of every context migration. For
this reason, the standard Linux ABI has removed FENCE.I from user-level
and now requires a system call to maintain instruction-fetch coherence,
which allows the OS to minimize the number of FENCE.I executions
required on current systems and provides forward-compatibility with
future improved instruction-fetch coherence mechanisms.
Future approaches to instruction-fetch coherence under discussion
include providing more restricted versions of FENCE.I that only target a
given address specified in rs1, and/or allowing software to use an ABI
that relies on machine-mode cache-maintenance operations.


| M | R | S | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| 0 | 0 | FENCE.I | 0 | MISC-MEM


The FENCE.I instruction is used to synchronize the instruction and data
streams. RISC-V does not guarantee that stores to instruction memory
will be made visible to instruction fetches on a RISC-V hart until that
hart executes a FENCE.I instruction. A FENCE.I instruction ensures that
a subsequent instruction fetch on a RISC-V hart will see any previous
data stores already visible to the same RISC-V hart. FENCE.I does not
ensure that other RISC-V harts’ instruction fetches will observe the
local hart’s stores in a multiprocessor system. To make a store to
instruction memory visible to all RISC-V harts, the writing hart also
has to execute a data FENCE before requesting that all remote RISC-V
harts execute a FENCE.I.
The unused fields in the FENCE.I instruction, imm[11:0], rs1, and
rd, are reserved for finer-grain fences in future extensions. For
forward compatibility, base implementations shall ignore these fields,
and standard software shall zero these fields.

Because FENCE.I only orders stores with a hart’s own instruction
fetches, application code should only rely upon FENCE.I if the
application thread will not be migrated to a different hart. The EEI can
provide mechanisms for efficient multiprocessor instruction-stream
synchronization.

# “Zihintntl” Non-Temporal Locality Hints, Version 0.2
Warning! This draft specification may change before being accepted as
standard by RISC-V International.
The NTL instructions are HINTs that indicate that the explicit memory
accesses of the immediately subsequent instruction (henceforth “target
instruction”) exhibit poor temporal locality of reference. The NTL
instructions do not change architectural state, nor do they alter the
architecturally visible effects of the target instruction. Four variants
are provided:
The NTL.P1 instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of the innermost level of
private cache in the memory hierarchy. NTL.P1 is encoded as
ADD x0, x0, x2.
The NTL.PALL instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of any level of private
cache in the memory hierarchy. NTL.PALL is encoded as ADD x0, x0, x3.
The NTL.S1 instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of the innermost level of
shared cache in the memory hierarchy. NTL.S1 is encoded as
ADD x0, x0, x4.
The NTL.ALL instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of any level of cache in
the memory hierarchy. NTL.ALL is encoded as ADD x0, x0, x5.

The NTL instructions can be used to avoid cache pollution when streaming
data or traversing large data structures, or to reduce latency in
producer-consumer interactions.
A microarchitecture might use the NTL instructions to inform the cache
replacement policy, or to decide which cache to allocate into, or to
avoid cache allocation altogether. For example, NTL.P1 might indicate
that an implementation should not allocate a line in a private L1 cache,
but should allocate in L2 (whether private or shared). In another
implementation, NTL.P1 might allocate the line in L1, but in the
least-recently used state.
NTL.ALL will typically inform implementations not to allocate anywhere
in the cache hierarchy. Programmers should use NTL.ALL for accesses that
have no exploitable temporal locality.
Like any HINTs, these instructions may be freely ignored. Hence,
although they are described in terms of cache-based memory hierarchies,
they do not mandate the provision of caches.
Some implementations might respect these HINTs for some memory accesses
but not others: e.g., implementations that implement LR/SC by acquiring
a cache line in the exclusive state in L1 might ignore NTL instructions
on LR and SC, but might respect NTL instructions for AMOs and regular
loads and stores.

Table 1.1 lists several software use
cases and the recommended NTL variant that portable software—i.e.,
software not tuned for any specific implementation’s memory
hierarchy—should use in each case.


Scenario
Recommended NTL variant


Access to a working set between and in size
NTL.P1


Access to a working set between and in size
NTL.PALL


Access to a working set greater than in size
NTL.S1


Access with no exploitable temporal locality (e.g., streaming)
NTL.ALL


Access to a contended synchronization variable
NTL.PALL


Recommended NTL variant for portable software to employ in various
scenarios.


The working-set sizes listed in
Table 1.1 are not meant to constrain
implementers’ cache-sizing decisions. Cache sizes will obviously vary
between implementations, and so software writers should only take these
working-set sizes as rough guidelines.

Table [tab:ntl] lists several sample memory
hierarchies and recommends how each NTL variant maps onto each cache
level. The table also recommends which NTL variant that
implementation-tuned software should use to avoid allocating in a
particular cache level. For example, for a system with a private L1 and
a shared L2, it is recommended that NTL.P1 and NTL.PALL indicate that
temporal locality cannot be exploited by the L1, and that NTL.S1 and
NTL.ALL indicate that temporal locality cannot be exploited by the L2.
Furthermore, software tuned for such a system should use NTL.P1 to
indicate a lack of temporal locality exploitable by the L1, or should
use NTL.ALL indicate a lack of temporal locality exploitable by the L2.


If the C extension is provided, compressed variants of these HINTs are
also provided: C.NTL.P1 is encoded as C.ADD x0, x2; C.NTL.PALL is
encoded as C.ADD x0, x3; C.NTL.S1 is encoded as C.ADD x0, x4; and
C.NTL.ALL is encoded as C.ADD x0, x5.
The NTL instructions affect all memory-access instructions except the
cache-management instructions in the Zicbom extension.

As of this writing, there are no other exceptions to this rule, and so
the NTL instructions affect all memory-access instructions defined in
the base ISAs and the A, F, D, Q, C, and V standard extensions, as well
as those defined within the hypervisor extension in Volume II.
The NTL instructions can affect cache-management operations other than
those in the Zicbom extension. For example, NTL.PALL followed by
CBO.ZERO might indicate that the line should be allocated in L3 and
zeroed, but not allocated in L1 or L2.

When an NTL instruction is applied to a prefetch hint in the Zicbop
extension, it indicates that a cache line should be prefetched into a
cache that is outer from the level specified by the NTL.

For example, in a system with a private L1 and shared L2, NTL.P1
followed by PREFETCH.R might prefetch into L2 with read intent.
To prefetch into the innermost level of cache, do not prefix the
prefetch instruction with an NTL instruction.
In some systems, NTL.ALL followed by a prefetch instruction might
prefetch into a cache or prefetch buffer internal to a memory
controller.

Software is discouraged from following an NTL instruction with an
instruction that does not explicitly access memory. Nonadherence to this
recommendation might reduce performance but otherwise has no
architecturally visible effect.
In the event that a trap is taken on the target instruction,
implementations are discouraged from applying the NTL to the first
instruction in the trap handler. Instead, implementations are
recommended to ignore the HINT in this case.

If an interrupt occurs between the execution of an NTL instruction and
its target instruction, execution will normally resume at the target
instruction. That the NTL instruction is not reexecuted does not change
the semantics of the program.
Some implementations might prefer not to process the NTL instruction
until the target instruction is seen (e.g., so that the NTL can be fused
with the memory access it modifies). Such implementations might
preferentially take the interrupt before the NTL, rather than between
the NTL and the memory access.


Since the NTL instructions are encoded as ADDs, they can be used within
LR/SC loops without voiding the forward-progress guarantee. But, since
using other loads and stores within an LR/SC loop does void the
forward-progress guarantee, the only reason to use an NTL within such a
loop is to modify the LR or the SC.

# “Zihintpause” Pause Hint, Version 2.0
The PAUSE instruction is a HINT that indicates the current hart’s rate
of instruction retirement should be temporarily reduced or paused. The
duration of its effect must be bounded and may be zero.

Software can use the PAUSE instruction to reduce energy consumption
while executing spin-wait code sequences. Multithreaded cores might
temporarily relinquish execution resources to other harts when PAUSE is
executed. It is recommended that a PAUSE instruction generally be
included in the code sequence for a spin-wait loop.
A future extension might add primitives similar to the x86 MONITOR/MWAIT
instructions, which provide a more efficient mechanism to wait on writes
to a specific memory location. However, these instructions would not
supplant PAUSE. PAUSE is more appropriate when polling for non-memory
events, when polling for multiple events, or when software does not know
precisely what events it is polling for.
The duration of a PAUSE instruction’s effect may vary significantly
within and among implementations. In typical implementations this
duration should be much less than the time to perform a context switch,
probably more on the rough order of an on-chip cache miss latency or a
cacheless access to main memory.
A series of PAUSE instructions can be used to create a cumulative delay
loosely proportional to the number of PAUSE instructions. In spin-wait
loops in portable code, however, only one PAUSE instruction should be
used before re-evaluating loop conditions, else the hart might stall
longer than optimal on some implementations, degrading system
performance.

PAUSE is encoded as a FENCE instruction with pred=W, succ=0, fm=0,
rd=x0, and rs1=x0.

PAUSE is encoded as a hint within the FENCE opcode because some
implementations are expected to deliberately stall the PAUSE instruction
until outstanding memory transactions have completed. Because the
successor set is null, however, PAUSE does not mandate any particular
memory ordering—hence, it truly is a HINT.
Like other FENCE instructions, PAUSE cannot be used within LR/SC
sequences without voiding the forward-progress guarantee.
The choice of a predecessor set of W is arbitrary, since the successor
set is null. Other HINTs similar to PAUSE might be encoded with other
predecessor sets.

# RV32E and RV64E Base Integer Instruction Sets, Version 1.95
This chapter describes a proposal for the RV32E and RV64E base integer
instruction sets, designed for microcontrollers in embedded systems.
RV32E and RV64E are reduced versions of RV32I and RV64I, respectively:
the only change is to reduce the number of integer registers to 16. This
chapter only outlines the differences between RV32E/RV64E and
RV32I/RV64I, and so should be read after
Chapters [rv32] and
[rv64].

RV32E was designed to provide an even smaller base core for embedded
microcontrollers. There is also interest in RV64E for microcontrollers
within large SoC designs, and to reduce context state for highly
threaded 64-bit processors.
Unless otherwise stated, standard extensions compatible with RV32I and
RV64I are also compatible with RV32E and RV64E, respectively.

RV32E and RV64E Programmers’ Model

RV32E and RV64E reduce the integer register count to 16 general-purpose
registers, (x0–x15), where x0 is a dedicated zero register.

We have found that in the small RV32I core implementations, the upper 16
registers consume around one quarter of the total area of the core
excluding memories, thus their removal saves around 25% core area with a
corresponding core power reduction.

RV32E and RV64E Instruction Set Encoding

RV32E and RV64E use the same instruction-set encoding as RV32I and RV64I
respectively, except that only registers x0–x15 are provided. All
encodings specifying the other registers x16– x31 are reserved.

The previous draft of this chapter made all encodings using the
 x16–x31 registers available as custom. This version takes a more
conservative approach, making these reserved so that they can be
allocated between custom space or new standard encodings at a later
date.

# RV64I Base Integer Instruction Set, Version 2.1
This chapter describes the RV64I base integer instruction set, which
builds upon the RV32I variant described in
Chapter [rv32]. This chapter presents only the
differences with RV32I, so should be read in conjunction with the
earlier chapter.
Register State

RV64I widens the integer registers and supported user address space to
64 bits (XLEN=64 in Figure [gprs]).
Integer Computational Instructions

Most integer computational instructions operate on XLEN-bit values.
Additional instruction variants are provided to manipulate 32-bit values
in RV64I, indicated by a ‘W’ suffix to the opcode. These “*W”
instructions ignore the upper 32 bits of their inputs and always produce
32-bit signed values, sign-extending them to 64 bits, i.e. bits XLEN-1
through 31 are equal.

The compiler and calling convention maintain an invariant that all
32-bit values are held in a sign-extended format in 64-bit registers.
Even 32-bit unsigned integers extend bit 31 into bits 63 through 32.
Consequently, conversion between unsigned and signed 32-bit integers is
a no-op, as is conversion from a signed 32-bit integer to a signed
64-bit integer. Existing 64-bit wide SLTU and unsigned branch compares
still operate correctly on unsigned 32-bit integers under this
invariant. Similarly, existing 64-bit wide logical operations on 32-bit
sign-extended integers preserve the sign-extension property. A few new
instructions (ADD[I]W/SUBW/SxxW) are required for addition and shifts
to ensure reasonable performance for 32-bit values.

Integer Register-Immediate Instructions


| M | R | S | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| I-immediate[11:0] | src | ADDIW | dest | OP-IMM-32


ADDIW is an RV64I instruction that adds the sign-extended 12-bit
immediate to register rs1 and produces the proper sign-extension of a
32-bit result in rd. Overflows are ignored and the result is the low
32 bits of the result sign-extended to 64 bits. Note, ADDIW rd, rs1, 0
writes the sign-extension of the lower 32 bits of register rs1 into
register rd (assembler pseudoinstruction SEXT.W).


| R | W | R | R | R | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | | 5 | 5 | 3 | 5 | 7

| 000000 | shamt[5] | shamt[4:0] | src | SLLI | dest | OP-IMM

| 000000 | shamt[5] | shamt[4:0] | src | SRLI | dest | OP-IMM

| 010000 | shamt[5] | shamt[4:0] | src | SRAI | dest | OP-IMM

| 000000 | 0 | shamt[4:0] | src | SLLIW | dest | OP-IMM-32

| 000000 | 0 | shamt[4:0] | src | SRLIW | dest | OP-IMM-32

| 010000 | 0 | shamt[4:0] | src | SRAIW | dest | OP-IMM-32


Shifts by a constant are encoded as a specialization of the I-type
format using the same instruction opcode as RV32I. The operand to be
shifted is in rs1, and the shift amount is encoded in the lower 6 bits
of the I-immediate field for RV64I. The right shift type is encoded in
bit 30. SLLI is a logical left shift (zeros are shifted into the lower
bits); SRLI is a logical right shift (zeros are shifted into the upper
bits); and SRAI is an arithmetic right shift (the original sign bit is
copied into the vacated upper bits).
SLLIW, SRLIW, and SRAIW are RV64I-only instructions that are analogously
defined but operate on 32-bit values and sign-extend their 32-bit
results to 64 bits. SLLIW, SRLIW, and SRAIW encodings with
imm[5] ≠ 0 are reserved.

Previously, SLLIW, SRLIW, and SRAIW with imm[5] ≠ 0 were defined
to cause illegal instruction exceptions, whereas now they are marked as
reserved. This is a backwards-compatible change.


| U | R | O

|:- |:-
| | |

| | |

| | 5 | 7

| U-immediate[31:12] | dest | LUI

| U-immediate[31:12] | dest | AUIPC


LUI (load upper immediate) uses the same opcode as RV32I. LUI places the
32-bit U-immediate into register rd, filling in the lowest 12 bits
with zeros. The 32-bit result is sign-extended to 64 bits.
AUIPC (add upper immediate to pc) uses the same opcode as RV32I. AUIPC
is used to build  pc-relative addresses and uses the U-type format.
AUIPC forms a 32-bit offset from the U-immediate, filling in the lowest
12 bits with zeros, sign-extends the result to 64 bits, adds it to the
address of the AUIPC instruction, then places the result in register
rd.

Note that the set of address offsets that can be formed by pairing LUI
with LD, AUIPC with JALR, etc.in RV64I is
[ − 2³¹ − 2¹¹,
2³¹ − 2¹¹ − 1].

Integer Register-Register Operations


| S | R | R | S | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| 0000000 | src2 | src1 | SLL/SRL | dest | OP

| 0100000 | src2 | src1 | SRA | dest | OP

| 0000000 | src2 | src1 | ADDW | dest | OP-32

| 0000000 | src2 | src1 | SLLW/SRLW | dest | OP-32

| 0100000 | src2 | src1 | SUBW/SRAW | dest | OP-32


ADDW and SUBW are RV64I-only instructions that are defined analogously
to ADD and SUB but operate on 32-bit values and produce signed 32-bit
results. Overflows are ignored, and the low 32-bits of the result is
sign-extended to 64-bits and written to the destination register.
SLL, SRL, and SRA perform logical left, logical right, and arithmetic
right shifts on the value in register rs1 by the shift amount held in
register rs2. In RV64I, only the low 6 bits of rs2 are considered
for the shift amount.
SLLW, SRLW, and SRAW are RV64I-only instructions that are analogously
defined but operate on 32-bit values and sign-extend their 32-bit
results to 64 bits. The shift amount is given by rs2[4:0].
Load and Store Instructions

RV64I extends the address space to 64 bits. The execution environment
will define what portions of the address space are legal to access.


| M | R | S | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | width | dest | LOAD


| O | R | R | S | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| offset[11:5] | src | base | width | offset[4:0] | STORE


The LD instruction loads a 64-bit value from memory into register rd
for RV64I.
The LW instruction loads a 32-bit value from memory and sign-extends
this to 64 bits before storing it in register rd for RV64I. The LWU
instruction, on the other hand, zero-extends the 32-bit value from
memory for RV64I. LH and LHU are defined analogously for 16-bit values,
as are LB and LBU for 8-bit values. The SD, SW, SH, and SB instructions
store 64-bit, 32-bit, 16-bit, and 8-bit values from the low bits of
register rs2 to memory respectively.
HINT Instructions

All instructions that are microarchitectural HINTs in RV32I (see
Section [sec:rv32i-hints]) are also HINTs
in RV64I. The additional computational instructions in RV64I expand both
the standard and custom HINT encoding spaces.
Table [tab:rv64i-hints] lists all RV64I
HINT code points. 91% of the HINT space is reserved for standard HINTs.
The remainder of the HINT space is designated for custom HINTs: no
standard HINTs will ever be defined in this subspace.

| |l|l|c|l| Instruction | Constraints | Code Points | Purpose

| LUI | rd=x0 | 2²⁰ |

| AUIPC | rd=x0 | 2²⁰ |

| | rd=x0, and either | |

| | rs1≠x0 or imm≠0 | |

| ANDI | rd=x0 | 2¹⁷ |

| ORI | rd=x0 | 2¹⁷ |

| XORI | rd=x0 | 2¹⁷ |

| ADDIW | rd=x0 | 2¹⁷ |

| ADD | rd=x0, rs1≠x0 | 2¹⁰ − 32 |

| | rd=x0, rs1=x0, | |

| | rs2≠x2–x5 | |

| | | | (rs2=x2) NTL.P1

| | | | (rs2=x3) NTL.PALL

| | | | (rs2=x4) NTL.S1

| | | | (rs2=x5) NTL.ALL

| SUB | rd=x0 | 2¹⁰ |

| AND | rd=x0 | 2¹⁰ |

| OR | rd=x0 | 2¹⁰ |

| XOR | rd=x0 | 2¹⁰ |

| SLL | rd=x0 | 2¹⁰ |

| SRL | rd=x0 | 2¹⁰ |

| SRA | rd=x0 | 2¹⁰ |

| ADDW | rd=x0 | 2¹⁰ |

| SUBW | rd=x0 | 2¹⁰ |

| SLLW | rd=x0 | 2¹⁰ |

| SRLW | rd=x0 | 2¹⁰ |

| SRAW | rd=x0 | 2¹⁰ |

| | rd=x0, rs1≠x0, | |

| | fm=0, and either | |

| | pred=0 or succ=0 | |

| | rd≠x0, rs1=x0, | |

| | fm=0, and either | |

| | pred=0 or succ=0 | |

| | rd=rs1=x0, fm=0, | |

| | pred=0, succ≠0 | |

| | rd=rs1=x0, fm=0, | |

| | pred≠W, succ=0 | |

| | rd=rs1=x0, fm=0, | |

| | pred=W, succ=0 | |

| SLTI | rd=x0 | 2¹⁷ |

| SLTIU | rd=x0 | 2¹⁷ |

| SLLI | rd=x0 | 2¹¹ |

| SRLI | rd=x0 | 2¹¹ |

| SRAI | rd=x0 | 2¹¹ |

| SLLIW | rd=x0 | 2¹⁰ |

| SRLIW | rd=x0 | 2¹⁰ |

| SRAIW | rd=x0 | 2¹⁰ |

| SLT | rd=x0 | 2¹⁰ |

| SLTU | rd=x0 | 2¹⁰ |

# RV128I Base Integer Instruction Set, Version 1.7

“There is only one mistake that can be made in computer design that
is difficult to recover from—not having enough address bits for memory
addressing and memory management.” Bell and Strecker, ISCA-3, 1976.

This chapter describes RV128I, a variant of the RISC-V ISA supporting a
flat 128-bit address space. The variant is a straightforward
extrapolation of the existing RV32I and RV64I designs.

The primary reason to extend integer register width is to support larger
address spaces. It is not clear when a flat address space larger than 64
bits will be required. At the time of writing, the fastest supercomputer
in the world as measured by the Top500 benchmark had over of DRAM, and
would require over 50 bits of address space if all the DRAM resided in a
single address space. Some warehouse-scale computers already contain
even larger quantities of DRAM, and new dense solid-state non-volatile
memories and fast interconnect technologies might drive a demand for
even larger memory spaces. Exascale systems research is targeting memory
systems, which occupy 57 bits of address space. At historic rates of
growth, it is possible that greater than 64 bits of address space might
be required before 2030.
History suggests that whenever it becomes clear that more than 64 bits
of address space is needed, architects will repeat intensive debates
about alternatives to extending the address space, including
segmentation, 96-bit address spaces, and software workarounds, until,
finally, flat 128-bit address spaces will be adopted as the simplest and
best solution.
We have not frozen the RV128 spec at this time, as there might be need
to evolve the design based on actual usage of 128-bit address spaces.

RV128I builds upon RV64I in the same way RV64I builds upon RV32I, with
integer registers extended to 128 bits (i.e., XLEN=128). Most integer
computational instructions are unchanged as they are defined to operate
on XLEN bits. The RV64I “*W” integer instructions that operate on
32-bit values in the low bits of a register are retained but now sign
extend their results from bit 31 to bit 127. A new set of “*D” integer
instructions are added that operate on 64-bit values held in the low
bits of the 128-bit integer registers and sign extend their results from
bit 63 to bit 127. The “*D” instructions consume two major opcodes
(OP-IMM-64 and OP-64) in the standard 32-bit encoding.

To improve compatibility with RV64, in a reverse of how RV32 to RV64 was
handled, we might change the decoding around to rename RV64I ADD as a
64-bit ADDD, and add a 128-bit ADDQ in what was previously the OP-64
major opcode (now renamed the OP-128 major opcode).

Shifts by an immediate (SLLI/SRLI/SRAI) are now encoded using the low 7
bits of the I-immediate, and variable shifts (SLL/SRL/SRA) use the low 7
bits of the shift amount source register.
A LDU (load double unsigned) instruction is added using the existing
LOAD major opcode, along with new LQ and SQ instructions to load and
store quadword values. SQ is added to the STORE major opcode, while LQ
is added to the MISC-MEM major opcode.
The floating-point instruction set is unchanged, although the 128-bit Q
floating-point extension can now support FMV.X.Q and FMV.Q.X
instructions, together with additional FCVT instructions to and from the
T (128-bit) integer format.
“M” Standard Extension for Integer Multiplication and Division, Version 2.0

This chapter describes the standard integer multiplication and division
instruction extension, which is named “M” and contains instructions that
multiply or divide values held in two integer registers.

We separate integer multiply and divide out from the base to simplify
low-end implementations, or for applications where integer multiply and
divide operations are either infrequent or better handled in attached
accelerators.

Multiplication Operations


| S | R | R | S | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| MULDIV | multiplier | multiplicand | MUL/MULH[[S]U] | dest | OP

| MULDIV | multiplier | multiplicand | MULW | dest | OP-32


MUL performs an XLEN-bit×XLEN-bit multiplication of rs1 by rs2 and
places the lower XLEN bits in the destination register. MULH, MULHU, and
MULHSU perform the same multiplication but return the upper XLEN bits of
the full 2×XLEN-bit product, for signed×signed, unsigned×unsigned, and ×
multiplication, respectively. If both the high and low bits of the same
product are required, then the recommended code sequence is:
MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register
specifiers must be in same order and rdh cannot be the same as rs1
or rs2). Microarchitectures can then fuse these into a single multiply
operation instead of performing two separate multiplies.

MULHSU is used in multi-word signed multiplication to multiply the
most-significant word of the multiplicand (which contains the sign bit)
with the less-significant words of the multiplier (which are unsigned).

MULW is an RV64 instruction that multiplies the lower 32 bits of the
source registers, placing the sign-extension of the lower 32 bits of the
result into the destination register.

In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit
product, but signed arguments must be proper 32-bit signed values,
whereas unsigned arguments must have their upper 32 bits clear. If the
arguments are not known to be sign- or zero-extended, an alternative is
to shift both arguments left by 32 bits, then use MULH[[S]U].

Division Operations


| S | R | R | O | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| MULDIV | divisor | dividend | DIV[U]/REM[U] | dest | OP

| MULDIV | divisor | dividend | DIV[U]W/REM[U]W | dest | OP-32


DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned
integer division of rs1 by rs2, rounding towards zero. REM and REMU
provide the remainder of the corresponding division operation. For REM,
the sign of the result equals the sign of the dividend.

For both signed and unsigned division, it holds that
dividend = divisor × quotient + remainder.

If both the quotient and remainder are required from the same division,
the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U]
rdr, rs1, rs2 (rdq cannot be the same as rs1 or rs2).
Microarchitectures can then fuse these into a single divide operation
instead of performing two separate divides.
DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of
rs1 by the lower 32 bits of rs2, treating them as signed and
unsigned integers respectively, placing the 32-bit quotient in rd,
sign-extended to 64 bits. REMW and REMUW are RV64 instructions that
provide the corresponding signed and unsigned remainder operations
respectively. Both REMW and REMUW always sign-extend the 32-bit result
to 64 bits, including on a divide by zero.
The semantics for division by zero and division overflow are summarized
in Table 1.1. The quotient of division by zero
has all bits set, and the remainder of division by zero equals the
dividend. Signed division overflow occurs only when the most-negative
integer is divided by  − 1. The quotient of a signed division with
overflow is equal to the dividend, and the remainder is zero. Unsigned
division overflow cannot occur.


Condition
Dividend
Divisor
DIVU[W]
REMU[W]
DIV[W]
REM[W]


Division by zero
x
0
2^L − 1
x
 − 1
x


Overflow (signed only)
 − 2^L − 1
 − 1
–
–
 − 2^L − 1
0


Semantics for division by zero and division overflow. L is the width of
the operation in bits: XLEN for DIV[U] and REM[U], or 32 for
DIV[U]W and REM[U]W.


We considered raising exceptions on integer divide by zero, with these
exceptions causing a trap in most execution environments. However, this
would be the only arithmetic trap in the standard ISA (floating-point
exceptions set flags and write default values, but do not cause traps)
and would require language implementors to interact with the execution
environment’s trap handlers for this case. Further, where language
standards mandate that a divide-by-zero exception must cause an
immediate control flow change, only a single branch instruction needs to
be added to each divide operation, and this branch instruction can be
inserted after the divide and should normally be very predictably not
taken, adding little runtime overhead.
The value of all bits set is returned for both unsigned and signed
divide by zero to simplify the divider circuitry. The value of all 1s is
both the natural value to return for unsigned divide, representing the
largest unsigned number, and also the natural result for simple unsigned
divider implementations. Signed division is often implemented using an
unsigned division circuit and specifying the same overflow result
simplifies the hardware.

Zmmul Extension, Version 1.0

The Zmmul extension implements the multiplication subset of the M
extension. It adds all of the instructions defined in
Section 1.1, namely: MUL, MULH,
MULHU, MULHSU, and (for RV64 only) MULW. The encodings are identical to
those of the corresponding M-extension instructions.

The Zmmul extension enables low-cost implementations that require
multiplication operations but not division. For many microcontroller
applications, division operations are too infrequent to justify the cost
of divider hardware. By contrast, multiplication operations are more
frequent, making the cost of multiplier hardware more justifiable.
Simple FPGA soft cores particularly benefit from eliminating division
but retaining multiplication, since many FPGAs provide hardwired
multipliers but require dividers be implemented in soft logic.

# “A” Standard Extension for Atomic Instructions, Version 2.1
The standard atomic-instruction extension, named “A”, contains
instructions that atomically read-modify-write memory to support
synchronization between multiple RISC-V harts running in the same memory
space. The two forms of atomic instruction provided are
load-reserved/store-conditional instructions and atomic fetch-and-op
memory instructions. Both types of atomic instruction support various
memory consistency orderings including unordered, acquire, release, and
sequentially consistent semantics. These instructions allow RISC-V to
support the RCsc memory consistency model .

After much debate, the language community and architecture community
appear to have finally settled on release consistency as the standard
memory consistency model and so the RISC-V atomic support is built
around this model.

Specifying Ordering of Atomic Instructions

The base RISC-V ISA has a relaxed memory model, with the FENCE
instruction used to impose additional ordering constraints. The address
space is divided by the execution environment into memory and I/O
domains, and the FENCE instruction provides options to order accesses to
one or both of these two address domains.
To provide more efficient support for release consistency , each atomic
instruction has two bits, aq and rl, used to specify additional
memory ordering constraints as viewed by other RISC-V harts. The bits
order accesses to one of the two address domains, memory or I/O,
depending on which address domain the atomic instruction is accessing.
No ordering constraint is implied to accesses to the other domain, and a
FENCE instruction should be used to order across both domains.
If both bits are clear, no additional ordering constraints are imposed
on the atomic memory operation. If only the aq bit is set, the atomic
memory operation is treated as an acquire access, i.e., no following
memory operations on this RISC-V hart can be observed to take place
before the acquire memory operation. If only the rl bit is set, the
atomic memory operation is treated as a release access, i.e., the
release memory operation cannot be observed to take place before any
earlier memory operations on this RISC-V hart. If both the aq and rl
bits are set, the atomic memory operation is sequentially consistent
and cannot be observed to happen before any earlier memory operations or
after any later memory operations in the same RISC-V hart and to the
same address domain.
Load-Reserved/Store-Conditional Instructions


| R | W | W | R | R | F | R | O

|:- |:- |:- |:- |:- |:- |:-
| | | | | | | |

| | | | | | | |

| | 1 | 1 | 5 | 5 | 3 | 5 | 7

| LR.W/D | | 0 | addr | width | dest | AMO

| SC.W/D | | src | addr | width | dest | AMO


Complex atomic memory operations on a single memory word or doubleword
are performed with the load-reserved (LR) and store-conditional (SC)
instructions. LR.W loads a word from the address in rs1, places the
sign-extended value in rd, and registers a reservation set—a set of
bytes that subsumes the bytes in the addressed word. SC.W conditionally
writes a word in rs2 to the address in rs1: the SC.W succeeds only
if the reservation is still valid and the reservation set contains the
bytes being written. If the SC.W succeeds, the instruction writes the
word in rs2 to memory, and it writes zero to rd. If the SC.W fails,
the instruction does not write to memory, and it writes a nonzero value
to rd. Regardless of success or failure, executing an SC.W instruction
invalidates any reservation held by this hart. LR.D and SC.D act
analogously on doublewords and are only available on RV64. For RV64,
LR.W and SC.W sign-extend the value placed in rd.

Both compare-and-swap (CAS) and LR/SC can be used to build lock-free
data structures. After extensive discussion, we opted for LR/SC for
several reasons: 1) CAS suffers from the ABA problem, which LR/SC avoids
because it monitors all writes to the address rather than only checking
for changes in the data value; 2) CAS would also require a new integer
instruction format to support three source operands (address, compare
value, swap value) as well as a different memory system message format,
which would complicate microarchitectures; 3) Furthermore, to avoid the
ABA problem, other systems provide a double-wide CAS (DW-CAS) to allow a
counter to be tested and incremented along with a data word. This
requires reading five registers and writing two in one instruction, and
also a new larger memory system message type, further complicating
implementations; 4) LR/SC provides a more efficient implementation of
many primitives as it only requires one load as opposed to two with CAS
(one load before the CAS instruction to obtain a value for speculative
computation, then a second load as part of the CAS instruction to check
if value is unchanged before updating).
The main disadvantage of LR/SC over CAS is livelock, which we avoid,
under certain circumstances, with an architected guarantee of eventual
forward progress as described below. Another concern is whether the
influence of the current x86 architecture, with its DW-CAS, will
complicate porting of synchronization libraries and other software that
assumes DW-CAS is the basic machine primitive. A possible mitigating
factor is the recent addition of transactional memory instructions to
x86, which might cause a move away from DW-CAS.
More generally, a multi-word atomic primitive is desirable, but there is
still considerable debate about what form this should take, and
guaranteeing forward progress adds complexity to a system.

The failure code with value 1 encodes an unspecified failure. Other
failure codes are reserved at this time. Portable software should only
assume the failure code will be non-zero.

We reserve a failure code of 1 to mean “unspecified” so that simple
implementations may return this value using the existing mux required
for the SLT/SLTU instructions. More specific failure codes might be
defined in future versions or extensions to the ISA.

For LR and SC, the A extension requires that the address held in rs1
be naturally aligned to the size of the operand (i.e., eight-byte
aligned for 64-bit words and four-byte aligned for 32-bit words). If the
address is not naturally aligned, an address-misaligned exception or an
access-fault exception will be generated. The access-fault exception can
be generated for a memory access that would otherwise be able to
complete except for the misalignment, if the misaligned access should
not be emulated.

Emulating misaligned LR/SC sequences is impractical in most systems.
Misaligned LR/SC sequences also raise the possibility of accessing
multiple reservation sets at once, which present definitions do not
provide for.

An implementation can register an arbitrarily large reservation set on
each LR, provided the reservation set includes all bytes of the
addressed data word or doubleword. An SC can only pair with the most
recent LR in program order. An SC may succeed only if no store from
another hart to the reservation set can be observed to have occurred
between the LR and the SC, and if there is no other SC between the LR
and itself in program order. An SC may succeed only if no write from a
device other than a hart to the bytes accessed by the LR instruction can
be observed to have occurred between the LR and SC. Note this LR might
have had a different effective address and data size, but reserved the
SC’s address as part of the reservation set.

Following this model, in systems with memory translation, an SC is
allowed to succeed if the earlier LR reserved the same location using an
alias with a different virtual address, but is also allowed to fail if
the virtual address is different.


To accommodate legacy devices and buses, writes from devices other than
RISC-V harts are only required to invalidate reservations when they
overlap the bytes accessed by the LR. These writes are not required to
invalidate the reservation when they access other bytes in the
reservation set.

The SC must fail if the address is not within the reservation set of the
most recent LR in program order. The SC must fail if a store to the
reservation set from another hart can be observed to occur between the
LR and SC. The SC must fail if a write from some other device to the
bytes accessed by the LR can be observed to occur between the LR and SC.
(If such a device writes the reservation set but does not write the
bytes accessed by the LR, the SC may or may not fail.) An SC must fail
if there is another SC (to any address) between the LR and the SC in
program order. The precise statement of the atomicity requirements for
successful LR/SC sequences is defined by the Atomicity Axiom in
Section [sec:rvwmo].

The platform should provide a means to determine the size and shape of
the reservation set.
A platform specification may constrain the size and shape of the
reservation set.


A store-conditional instruction to a scratch word of memory should be
used to forcibly invalidate any existing load reservation:


during a preemptive context switch, and


if necessary when changing virtual to physical address mappings,
such as when migrating pages that might contain an active
reservation.


The invalidation of a hart’s reservation when it executes an LR or SC
imply that a hart can only hold one reservation at a time, and that an
SC can only pair with the most recent LR, and LR with the next following
SC, in program order. This is a restriction to the Atomicity Axiom in
Section [sec:rvwmo] that ensures software runs
correctly on expected common implementations that operate in this
manner.

An SC instruction can never be observed by another RISC-V hart before
the LR instruction that established the reservation. The LR/SC sequence
can be given acquire semantics by setting the aq bit on the LR
instruction. The LR/SC sequence can be given release semantics by
setting the rl bit on the SC instruction. Setting the aq bit on the
LR instruction, and setting both the aq and the rl bit on the SC
instruction makes the LR/SC sequence sequentially consistent, meaning
that it cannot be reordered with earlier or later memory operations from
the same hart.
If neither bit is set on both LR and SC, the LR/SC sequence can be
observed to occur before or after surrounding memory operations from the
same RISC-V hart. This can be appropriate when the LR/SC sequence is
used to implement a parallel reduction operation.
Software should not set the rl bit on an LR instruction unless the
aq bit is also set, nor should software set the aq bit on an SC
instruction unless the rl bit is also set. LR.rl and SC.aq
instructions are not guaranteed to provide any stronger ordering than
those with both bits clear, but may result in lower performance.

        # a0 holds address of memory location
        # a1 holds expected value
        # a2 holds desired value
        # a0 holds return value, 0 if successful, !0 otherwise
    cas:
        lr.w t0, (a0)        # Load original value.
        bne t0, a1, fail     # Doesn't match, so fail.
        sc.w t0, a2, (a0)    # Try to update.
        bnez t0, cas         # Retry if store-conditional failed.
        li a0, 0             # Set return to success.
        jr ra                # Return.
    fail:
        li a0, 1             # Set return to failure.
        jr ra                # Return.


LR/SC can be used to construct lock-free data structures. An example
using LR/SC to implement a compare-and-swap function is shown in
Figure [cas].
If inlined, compare-and-swap functionality need only take four
instructions.
Eventual Success of Store-Conditional Instructions

The standard A extension defines constrained LR/SC loops, which have
the following properties:


The loop comprises only an LR/SC sequence and code to retry the
sequence in the case of failure, and must comprise at most 16
instructions placed sequentially in memory.


An LR/SC sequence begins with an LR instruction and ends with an SC
instruction. The dynamic code executed between the LR and SC
instructions can only contain instructions from the base “I”
instruction set, excluding loads, stores, backward jumps, taken
backward branches, JALR, FENCE, and SYSTEM instructions. If the “C”
extension is supported, then compressed forms of the aforementioned
“I” instructions are also permitted.


The code to retry a failing LR/SC sequence can contain backwards
jumps and/or branches to repeat the LR/SC sequence, but otherwise
has the same constraint as the code between the LR and SC.


The LR and SC addresses must lie within a memory region with the
LR/SC eventuality property. The execution environment is
responsible for communicating which regions have this property.


The SC must be to the same effective address and of the same data
size as the latest LR executed by the same hart.


LR/SC sequences that do not lie within constrained LR/SC loops are
unconstrained. Unconstrained LR/SC sequences might succeed on some
attempts on some implementations, but might never succeed on other
implementations.

We restricted the length of LR/SC loops to fit within 64 contiguous
instruction bytes in the base ISA to avoid undue restrictions on
instruction cache and TLB size and associativity. Similarly, we
disallowed other loads and stores within the loops to avoid restrictions
on data-cache associativity in simple implementations that track the
reservation within a private cache. The restrictions on branches and
jumps limit the time that can be spent in the sequence. Floating-point
operations and integer multiply/divide were disallowed to simplify the
operating system’s emulation of these instructions on implementations
lacking appropriate hardware support.
Software is not forbidden from using unconstrained LR/SC sequences, but
portable software must detect the case that the sequence repeatedly
fails, then fall back to an alternate code sequence that does not rely
on an unconstrained LR/SC sequence. Implementations are permitted to
unconditionally fail any unconstrained LR/SC sequence.

If a hart H enters a constrained LR/SC loop, the execution environment
must guarantee that one of the following events eventually occurs:


H or some other hart executes a successful SC to the reservation
set of the LR instruction in H’s constrained LR/SC loops.


Some other hart executes an unconditional store or AMO instruction
to the reservation set of the LR instruction in H’s constrained
LR/SC loop, or some other device in the system writes to that
reservation set.


H executes a branch or jump that exits the constrained LR/SC loop.


H traps.


Note that these definitions permit an implementation to fail an SC
instruction occasionally for any reason, provided the aforementioned
guarantee is not violated.


As a consequence of the eventuality guarantee, if some harts in an
execution environment are executing constrained LR/SC loops, and no
other harts or devices in the execution environment execute an
unconditional store or AMO to that reservation set, then at least one
hart will eventually exit its constrained LR/SC loop. By contrast, if
other harts or devices continue to write to that reservation set, it is
not guaranteed that any hart will exit its LR/SC loop.
Loads and load-reserved instructions do not by themselves impede the
progress of other harts’ LR/SC sequences. We note this constraint
implies, among other things, that loads and load-reserved instructions
executed by other harts (possibly within the same core) cannot impede
LR/SC progress indefinitely. For example, cache evictions caused by
another hart sharing the cache cannot impede LR/SC progress
indefinitely. Typically, this implies reservations are tracked
independently of evictions from any shared cache. Similarly, cache
misses caused by speculative execution within a hart cannot impede LR/SC
progress indefinitely.
These definitions admit the possibility that SC instructions may
spuriously fail for implementation reasons, provided progress is
eventually made.


One advantage of CAS is that it guarantees that some hart eventually
makes progress, whereas an LR/SC atomic sequence could livelock
indefinitely on some systems. To avoid this concern, we added an
architectural guarantee of livelock freedom for certain LR/SC sequences.
Earlier versions of this specification imposed a stronger
starvation-freedom guarantee. However, the weaker livelock-freedom
guarantee is sufficient to implement the C11 and C++11 languages, and is
substantially easier to provide in some microarchitectural styles.

Atomic Memory Operations


| O | W | W | R | R | F | R | R

|:- |:- |:- |:- |:- |:- |:-
| | | | | | | |

| | | | | | | |

| | 1 | 1 | 5 | 5 | 3 | 5 | 7

| AMOSWAP.W/D | | src | addr | width | dest | AMO

| AMOADD.W/D | | src | addr | width | dest | AMO

| AMOAND.W/D | | src | addr | width | dest | AMO

| AMOOR.W/D | | src | addr | width | dest | AMO

| AMOXOR.W/D | | src | addr | width | dest | AMO

| AMOMAX[U].W/D | | src | addr | width | dest | AMO

| AMOMIN[U].W/D | | src | addr | width | dest | AMO


The atomic memory operation (AMO) instructions perform read-modify-write
operations for multiprocessor synchronization and are encoded with an
R-type instruction format. These AMO instructions atomically load a data
value from the address in rs1, place the value into register rd,
apply a binary operator to the loaded value and the original value in
rs2, then store the result back to the original address in rs1. AMOs
can either operate on 64-bit (RV64 only) or 32-bit words in memory. For
RV64, 32-bit AMOs always sign-extend the value placed in rd, and
ignore the upper 32 bits of the original value of rs2.
For AMOs, the A extension requires that the address held in rs1 be
naturally aligned to the size of the operand (i.e., eight-byte aligned
for 64-bit words and four-byte aligned for 32-bit words). If the address
is not naturally aligned, an address-misaligned exception or an
access-fault exception will be generated. The access-fault exception can
be generated for a memory access that would otherwise be able to
complete except for the misalignment, if the misaligned access should
not be emulated. The “Zam” extension, described in
Chapter [sec:zam], relaxes this requirement and
specifies the semantics of misaligned AMOs.
The operations supported are swap, integer add, bitwise AND, bitwise OR,
bitwise XOR, and signed and unsigned integer maximum and minimum.
Without ordering constraints, these AMOs can be used to implement
parallel reduction operations, where typically the return value would be
discarded by writing to x0.

We provided fetch-and-op style atomic primitives as they scale to highly
parallel systems better than LR/SC or CAS. A simple microarchitecture
can implement AMOs using the LR/SC primitives, provided the
implementation can guarantee the AMO eventually completes. More complex
implementations might also implement AMOs at memory controllers, and can
optimize away fetching the original value when the destination is x0.
The set of AMOs was chosen to support the C11/C++11 atomic memory
operations efficiently, and also to support parallel reductions in
memory. Another use of AMOs is to provide atomic updates to
memory-mapped device registers (e.g., setting, clearing, or toggling
bits) in the I/O space.

To help implement multiprocessor synchronization, the AMOs optionally
provide release consistency semantics. If the aq bit is set, then no
later memory operations in this RISC-V hart can be observed to take
place before the AMO. Conversely, if the rl bit is set, then other
RISC-V harts will not observe the AMO before memory accesses preceding
the AMO in this RISC-V hart. Setting both the aq and the rl bit on
an AMO makes the sequence sequentially consistent, meaning that it
cannot be reordered with earlier or later memory operations from the
same hart.

The AMOs were designed to implement the C11 and C++11 memory models
efficiently. Although the FENCE R, RW instruction suffices to implement
the acquire operation and FENCE RW, W suffices to implement release,
both imply additional unnecessary ordering as compared to AMOs with the
corresponding aq or rl bit set.

An example code sequence for a critical section guarded by a
test-and-test-and-set spinlock is shown in
Figure [critical]. Note the first AMO is marked
aq to order the lock acquisition before the critical section, and the
second AMO is marked rl to order the critical section before the lock
relinquishment.

        li           t0, 1        # Initialize swap value.
    again:
        lw           t1, (a0)     # Check if lock is held.
        bnez         t1, again    # Retry if held.
        amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock.
        bnez         t1, again    # Retry if held.
        # ...
        # Critical section.
        # ...
        amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.


We recommend the use of the AMO Swap idiom shown above for both lock
acquire and release to simplify the implementation of speculative lock
elision .

The instructions in the “A” extension can also be used to provide
sequentially consistent loads and stores. A sequentially consistent load
can be implemented as an LR with both aq and rl set. A sequentially
consistent store can be implemented as an AMOSWAP that writes the old
value to x0 and has both aq and rl set.
“Zicsr”, Control and Status Register (CSR) Instructions, Version 2.0

RISC-V defines a separate address space of 4096 Control and Status
registers associated with each hart. This chapter defines the full set
of CSR instructions that operate on these CSRs.

While CSRs are primarily used by the privileged architecture, there are
several uses in unprivileged code including for counters and timers, and
for floating-point status.
The counters and timers are no longer considered mandatory parts of the
standard base ISAs, and so the CSR instructions required to access them
have been moved out of Chapter [rv32] into this separate chapter.

CSR Instructions

All CSR instructions atomically read-modify-write a single CSR, whose
CSR specifier is encoded in the 12-bit csr field of the instruction
held in bits 31–20. The immediate forms use a 5-bit zero-extended
immediate encoded in the rs1 field.


| M | R | F | R | S

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| source/dest | source | CSRRW | dest | SYSTEM

| source/dest | source | CSRRS | dest | SYSTEM

| source/dest | source | CSRRC | dest | SYSTEM

| source/dest | uimm[4:0] | CSRRWI | dest | SYSTEM

| source/dest | uimm[4:0] | CSRRSI | dest | SYSTEM

| source/dest | uimm[4:0] | CSRRCI | dest | SYSTEM


The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in
the CSRs and integer registers. CSRRW reads the old value of the CSR,
zero-extends the value to XLEN bits, then writes it to integer register
rd. The initial value in rs1 is written to the CSR. If rd=x0,
then the instruction shall not read the CSR and shall not cause any of
the side effects that might occur on a CSR read.
The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value
of the CSR, zero-extends the value to XLEN bits, and writes it to
integer register rd. The initial value in integer register rs1 is
treated as a bit mask that specifies bit positions to be set in the CSR.
Any bit that is high in rs1 will cause the corresponding bit to be set
in the CSR, if that CSR bit is writable. Other bits in the CSR are not
explicitly written.
The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the
value of the CSR, zero-extends the value to XLEN bits, and writes it to
integer register rd. The initial value in integer register rs1 is
treated as a bit mask that specifies bit positions to be cleared in the
CSR. Any bit that is high in rs1 will cause the corresponding bit to
be cleared in the CSR, if that CSR bit is writable. Other bits in the
CSR are not explicitly written.
For both CSRRS and CSRRC, if rs1=x0, then the instruction will not
write to the CSR at all, and so shall not cause any of the side effects
that might otherwise occur on a CSR write, nor raise illegal instruction
exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always
read the addressed CSR and cause any read side effects regardless of
rs1 and rd fields. Note that if rs1 specifies a register holding a
zero value other than  x0, the instruction will still attempt to write
the unmodified value back to the CSR and will cause any attendant side
effects. A CSRRW with rs1=x0 will attempt to write zero to the
destination CSR.
The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and
CSRRC respectively, except they update the CSR using an XLEN-bit value
obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0])
field encoded in the rs1 field instead of a value from an integer
register. For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then
these instructions will not write to the CSR, and shall not cause any of
the side effects that might otherwise occur on a CSR write, nor raise
illegal instruction exceptions on accesses to read-only CSRs. For
CSRRWI, if rd=x0, then the instruction shall not read the CSR and
shall not cause any of the side effects that might occur on a CSR read.
Both CSRRSI and CSRRCI will always read the CSR and cause any read side
effects regardless of rd and rs1 fields.


Register operand


Instruction
rd is x0
rs1 is x0
Reads CSR
Writes CSR


CSRRW
Yes
–
No
Yes


CSRRW
No
–
Yes
Yes


CSRRS/CSRRC
–
Yes
Yes
No


CSRRS/CSRRC
–
No
Yes
Yes


Immediate operand


Instruction
rd is x0
uimm=0
Reads CSR
Writes CSR


CSRRWI
Yes
–
No
Yes


CSRRWI
No
–
Yes
Yes


CSRRSI/CSRRCI
–
Yes
Yes
No


CSRRSI/CSRRCI
–
No
Yes
Yes


Conditions determining whether a CSR instruction reads or writes the
specified CSR.

Table 1.1 summarizes the behavior of
the CSR instructions with respect to whether they read and/or write the
CSR.
For any event or consequence that occurs due to a CSR having a
particular value, if a write to the CSR gives it that value, the
resulting event or consequence is said to be an indirect effect of the
write. Indirect effects of a CSR write are not considered by the RISC-V
ISA to be side effects of that write.

An example of side effects for CSR accesses would be if reading from a
specific CSR causes a light bulb to turn on, while writing an odd value
to the same CSR causes the light to turn off. Assume writing an even
value has no effect. In this case, both the read and write have side
effects controlling whether the bulb is lit, as this condition is not
determined solely from the CSR value. (Note that after writing an odd
value to the CSR to turn off the light, then reading to turn the light
on, writing again the same odd value causes the light to turn off again.
Hence, on the last write, it is not a change in the CSR value that turns
off the light.)
On the other hand, if a bulb is rigged to light whenever the value of a
particular CSR is odd, then turning the light on and off is not
considered a side effect of writing to the CSR but merely an indirect
effect of such writes.
More concretely, the RISC-V privileged architecture defined in Volume II
specifies that certain combinations of CSR values cause a trap to occur.
When an explicit write to a CSR creates the conditions that trigger the
trap, the trap is not considered a side effect of the write but merely
an indirect effect.
Standard CSRs do not have any side effects on reads. Standard CSRs may
have side effects on writes. Custom extensions might add CSRs for which
accesses have side effects on either reads or writes.

Some CSRs, such as the instructions-retired counter, instret, may be
modified as side effects of instruction execution. In these cases, if a
CSR access instruction reads a CSR, it reads the value prior to the
execution of the instruction. If a CSR access instruction writes such a
CSR, the write is done instead of the increment. In particular, a value
written to instret by one instruction will be the value read by the
following instruction.
The assembler pseudoinstruction to read a CSR, CSRR rd, csr, is
encoded as CSRRS rd, csr, x0. The assembler pseudoinstruction to write
a CSR, CSRW csr, rs1, is encoded as CSRRW x0, csr, rs1, while CSRWI
csr, uimm, is encoded as CSRRWI x0, csr, uimm.
Further assembler pseudoinstructions are defined to set and clear bits
in the CSR when the old value is not required: CSRS/CSRC csr, rs1;
CSRSI/CSRCI csr, uimm.
CSR Access Ordering

Each RISC-V hart normally observes its own CSR accesses, including its
implicit CSR accesses, as performed in program order. In particular,
unless specified otherwise, a CSR access is performed after the
execution of any prior instructions in program order whose behavior
modifies or is modified by the CSR state and before the execution of any
subsequent instructions in program order whose behavior modifies or is
modified by the CSR state. Furthermore, an explicit CSR read returns the
CSR state before the execution of the instruction, while an explicit CSR
write suppresses and overrides any implicit writes or modifications to
the same CSR by the same instruction.
Likewise, any side effects from an explicit CSR access are normally
observed to occur synchronously in program order. Unless specified
otherwise, the full consequences of any such side effects are observable
by the very next instruction, and no consequences may be observed
out-of-order by preceding instructions. (Note the distinction made
earlier between side effects and indirect effects of CSR writes.)
For the RVWMO memory consistency model
(Chapter [ch:memorymodel]), CSR accesses are
weakly ordered by default, so other harts or devices may observe CSR
accesses in an order different from program order. In addition, CSR
accesses are not ordered with respect to explicit memory accesses,
unless a CSR access modifies the execution behavior of the instruction
that performs the explicit memory access or unless a CSR access and an
explicit memory access are ordered by either the syntactic dependencies
defined by the memory model or the ordering requirements defined by the
Memory-Ordering PMAs section in Volume II of this manual. To enforce
ordering in all other cases, software should execute a FENCE instruction
between the relevant accesses. For the purposes of the FENCE
instruction, CSR read accesses are classified as device input (I), and
CSR write accesses are classified as device output (O).

Informally, the CSR space acts as a weakly ordered memory-mapped I/O
region, as defined by the Memory-Ordering PMAs section in Volume II of
this manual. As a result, the order of CSR accesses with respect to all
other accesses is constrained by the same mechanisms that constrain the
order of memory-mapped I/O accesses to such a region.
These CSR-ordering constraints are imposed to support ordering main
memory and memory-mapped I/O accesses with respect to CSR accesses that
are visible to, or affected by, devices or other harts. Examples include
the time, cycle, and mcycle CSRs, in addition to CSRs that reflect
pending interrupts, like mip and sip. Note that implicit reads of
such CSRs (e.g., taking an interrupt because of a change in mip) are
also ordered as device input.
Most CSRs (including, e.g., the fcsr) are not visible to other harts;
their accesses can be freely reordered in the global memory order with
respect to FENCE instructions without violating this specification.

The hardware platform may define that accesses to certain CSRs are
strongly ordered, as defined by the Memory-Ordering PMAs section in
Volume II of this manual. Accesses to strongly ordered CSRs have
stronger ordering constraints with respect to accesses to both weakly
ordered CSRs and accesses to memory-mapped I/O regions.

The rules for the reordering of CSR accesses in the global memory order
should probably be moved to
Chapter [ch:memorymodel] concerning the
RVWMO memory consistency model.

# “Zicntr” and “Zihpm” Counters
RISC-V ISAs provide a set of up to thirty-two 64-bit performance
counters and timers that are accessible via unprivileged XLEN-bit
read-only CSR registers 0xC00–0xC1F (when XLEN=32, the upper 32 bits
are accessed via CSR registers 0xC80–0xC9F). These counters are
divided between the “Zicntr” and “Zihpm” extensions.
“Zicntr” Standard Extension for Base Counters and Timers

The Zicntr standard extension comprises the first three of these
counters (CYCLE, TIME, and INSTRET), which have dedicated functions
(cycle count, real-time clock, and instructions retired, respectively).
The Zicntr extension depends on the Zicsr extension.

We recommend provision of these basic counters in implementations as
they are essential for basic performance analysis, adaptive and dynamic
optimization, and to allow an application to work with real-time
streams. Additional counters in the separate Zihpm extension can help
diagnose performance problems and these should be made accessible from
user-level application code with low overhead.
Some execution environments might prohibit access to counters, for
example, to impede timing side-channel attacks.


| M | R | F | R | S

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| RDCYCLE[H] | 0 | CSRRS | dest | SYSTEM

| RDTIME[H] | 0 | CSRRS | dest | SYSTEM

| RDINSTRET[H] | 0 | CSRRS | dest | SYSTEM


For base ISAs with XLEN≥64, CSR instructions can access the full
64-bit CSRs directly. In particular, the RDCYCLE, RDTIME, and RDINSTRET
pseudoinstructions read the full 64 bits of the cycle, time, and
instret counters.

The counter pseudoinstructions are mapped to the read-only
csrrs rd, counter, x0 canonical form, but the other read-only CSR
instruction forms (based on CSRRC/CSRRSI/CSRRCI) are also legal ways to
read these CSRs.

For base ISAs with XLEN=32, the Zicntr extension enables the three
64-bit read-only counters to be accessed in 32-bit pieces. The RDCYCLE,
RDTIME, and RDINSTRET pseudoinstructions provide the lower 32 bits, and
the RDCYCLEH, RDTIMEH, and RDINSTRETH pseudoinstructions provide the
upper 32 bits of the respective counters.

We required the counters be 64 bits wide, even when XLEN=32, as
otherwise it is very difficult for software to determine if values have
overflowed. For a low-end implementation, the upper 32 bits of each
counter can be implemented using software counters incremented by a trap
handler triggered by overflow of the lower 32 bits. The sample code
given below shows how the full 64-bit width value can be safely read
using the individual 32-bit width pseudoinstructions.

The RDCYCLE pseudoinstruction reads the low XLEN bits of the  cycle
CSR which holds a count of the number of clock cycles executed by the
processor core on which the hart is running from an arbitrary start time
in the past. RDCYCLEH is only present when XLEN=32 and reads bits 63–32
of the same cycle counter. The underlying 64-bit counter should never
overflow in practice. The rate at which the cycle counter advances will
depend on the implementation and operating environment. The execution
environment should provide a means to determine the current rate
(cycles/second) at which the cycle counter is incrementing.

RDCYCLE is intended to return the number of cycles executed by the
processor core, not the hart. Precisely defining what is a “core” is
difficult given some implementation choices (e.g., AMD Bulldozer).
Precisely defining what is a “clock cycle” is also difficult given the
range of implementations (including software emulations), but the intent
is that RDCYCLE is used for performance monitoring along with the other
performance counters. In particular, where there is one hart/core, one
would expect cycle-count/instructions-retired to measure CPI for a hart.
Cores don’t have to be exposed to software at all, and an implementor
might choose to pretend multiple harts on one physical core are running
on separate cores with one hart/core, and provide separate cycle
counters for each hart. This might make sense in a simple barrel
processor (e.g., CDC 6600 peripheral processors) where inter-hart timing
interactions are non-existent or minimal.
Where there is more than one hart/core and dynamic multithreading, it is
not generally possible to separate out cycles per hart (especially with
SMT). It might be possible to define a separate performance counter that
tried to capture the number of cycles a particular hart was running, but
this definition would have to be very fuzzy to cover all the possible
threading implementations. For example, should we only count cycles for
which any instruction was issued to execution for this hart, and/or
cycles any instruction retired, or include cycles this hart was
occupying machine resources but couldn’t execute due to stalls while
other harts went into execution? Likely, “all of the above” would be
needed to have understandable performance stats. This complexity of
defining a per-hart cycle count, and also the need in any case for a
total per-core cycle count when tuning multithreaded code led to just
standardizing the per-core cycle counter, which also happens to work
well for the common single hart/core case.
Standardizing what happens during “sleep” is not practical given that
what “sleep” means is not standardized across execution environments,
but if the entire core is paused (entirely clock-gated or powered-down
in deep sleep), then it is not executing clock cycles, and the cycle
count shouldn’t be increasing per the spec. There are many details,
e.g., whether clock cycles required to reset a processor after waking up
from a power-down event should be counted, and these are considered
execution-environment-specific details.
Even though there is no precise definition that works for all platforms,
this is still a useful facility for most platforms, and an imprecise,
common, “usually correct” standard here is better than no standard. The
intent of RDCYCLE was primarily performance monitoring/tuning, and the
specification was written with that goal in mind.

The RDTIME pseudoinstruction reads the low XLEN bits of the  time CSR,
which counts wall-clock real time that has passed from an arbitrary
start time in the past. RDTIMEH is only present when XLEN=32 and reads
bits 63–32 of the same real-time counter. The underlying 64-bit counter
increments by one with each tick of the real-time clock, and, for
realistic real-time clock frequencies, should never overflow in
practice. The execution environment should provide a means of
determining the period of a counter tick (seconds/tick). The period
should be constant within a small error bound. The environment should
provide a means to determine the accuracy of the clock (i.e., the
maximum relative error between the nominal and actual real-time clock
periods).

On some simple platforms, cycle count might represent a valid
implementation of RDTIME, in which case RDTIME and RDCYCLE may return
the same result.
It is difficult to provide a strict mandate on clock period given the
wide variety of possible implementation platforms. The maximum error
bound should be set based on the requirements of the platform.

The real-time clocks of all harts must be synchronized to within one
tick of the real-time clock.

As with other architectural mandates, it suffices to appear “as if”
harts are synchronized to within one tick of the real-time clock, i.e.,
software is unable to observe that there is a greater delta between the
real-time clock values observed on two harts.

The RDINSTRET pseudoinstruction reads the low XLEN bits of the
 instret CSR, which counts the number of instructions retired by this
hart from some arbitrary start point in the past. RDINSTRETH is only
present when XLEN=32 and reads bits 63–32 of the same instruction
counter. The underlying 64-bit counter should never overflow in
practice.

Instructions that cause synchronous exceptions, including ECALL and
EBREAK, are not considered to retire and hence do not increment the
instret CSR.

The following code sequence will read a valid 64-bit cycle counter value
into x3:x2, even if the counter overflows its lower half between
reading its upper and lower halves.

    again:
        rdcycleh     x3
        rdcycle      x2
        rdcycleh     x4
        bne          x3, x4, again


“Zihpm” Standard Extension for Hardware Performance Counters

The Zihpm extension comprises up to 29 additional unprivileged 64-bit
hardware performance counters, hpmcounter3–hpmcounter31. When
XLEN=32, the upper 32 bits of these performance counters are accessible
via additional CSRs hpmcounter3h– hpmcounter31h. The Zihpm extension
depends on the Zicsr extension.

In some applications, it is important to be able to read multiple
counters at the same instant in time. When run under a multitasking
environment, a user thread can suffer a context switch while attempting
to read the counters. One solution is for the user thread to read the
real-time counter before and after reading the other counters to
determine if a context switch occurred in the middle of the sequence, in
which case the reads can be retried. We considered adding output latches
to allow a user thread to snapshot the counter values atomically, but
this would increase the size of the user context, especially for
implementations with a richer set of counters.

The implemented number and width of these additional counters, and the
set of events they count, is platform-specific. Accessing an
unimplemented or ill-configured counter may cause an illegal instruction
exception or may return a constant value.
The execution environment should provide a means to determine the number
and width of the implemented counters, and an interface to configure the
events to be counted by each counter.

For execution environments implemented on RISC-V privileged platforms,
the privileged architecture manual describes privileged CSRs controlling
access by lower privileged modes to these counters, and to set the
events to be counted.
Alternative execution environments (e.g., user-level-only software
performance models) may provide alternative mechanisms to configure the
events counted by the performance counters.
It would be useful to eventually standardize event settings to count
ISA-level metrics, such as the number of floating-point instructions
executed for example, and possibly a few common microarchitectural
metrics, such as “L1 instruction cache misses”.

# “F” Standard Extension for Single-Precision Floating-Point, Version 2.2
This chapter describes the standard instruction-set extension for
single-precision floating-point, which is named “F” and adds
single-precision floating-point computational instructions compliant
with the IEEE 754-2008 arithmetic standard . The F extension depends on
the “Zicsr” extension for control and status register access.
F Register State

The F extension adds 32 floating-point registers, f0–f31, each 32
bits wide, and a floating-point control and status register fcsr,
which contains the operating mode and exception status of the
floating-point unit. This additional state is shown in
Figure [fprs]. We use the term FLEN to describe the
width of the floating-point registers in the RISC-V ISA, and FLEN=32 for
the F single-precision floating-point extension. Most floating-point
instructions operate on values in the floating-point register file.
Floating-point load and store instructions transfer floating-point
values between registers and memory. Instructions to transfer values to
and from the integer register file are also provided.


FLEN


32


We considered a unified register file for both integer and
floating-point values as this simplifies software register allocation
and calling conventions, and reduces total user state. However, a split
organization increases the total number of registers accessible with a
given instruction width, simplifies provision of enough regfile ports
for wide superscalar issue, supports decoupled floating-point-unit
architectures, and simplifies use of internal floating-point encoding
techniques. Compiler support and calling conventions for split register
file architectures are well understood, and using dirty bits on
floating-point register file state can reduce context-switch overhead.

Floating-Point Control and Status Register

The floating-point control and status register, fcsr, is a RISC-V
control and status register (CSR). It is a 32-bit read/write register
that selects the dynamic rounding mode for floating-point arithmetic
operations and holds the accrued exception flags, as shown in
Figure [fcsr].


| K | E | ccccc | | | | | |

|:- |:-
| | |

| | | | | | |

| 24 | 3 | 1 | 1 | 1 | 1 | 1


The fcsr register can be read and written with the FRCSR and FSCSR
instructions, which are assembler pseudoinstructions built on the
underlying CSR access instructions. FRCSR reads fcsr by copying it
into integer register rd. FSCSR swaps the value in  fcsr by copying
the original value into integer register rd, and then writing a new
value obtained from integer register rs1 into fcsr.
The fields within the fcsr can also be accessed individually through
different CSR addresses, and separate assembler pseudoinstructions are
defined for these accesses. The FRRM instruction reads the Rounding Mode
field frm and copies it into the least-significant three bits of
integer register rd, with zero in all other bits. FSRM swaps the value
in frm by copying the original value into integer register rd, and
then writing a new value obtained from the three least-significant bits
of integer register rs1 into frm. FRFLAGS and FSFLAGS are defined
analogously for the Accrued Exception Flags field fflags.
Bits 31–8 of the fcsr are reserved for other standard extensions. If
these extensions are not present, implementations shall ignore writes to
these bits and supply a zero value when read. Standard software should
preserve the contents of these bits.
Floating-point operations use either a static rounding mode encoded in
the instruction, or a dynamic rounding mode held in frm. Rounding
modes are encoded as shown in
Table 1.1.
A value of 111 in the instruction’s rm field selects the dynamic
rounding mode held in frm. The behavior of floating-point instructions
that depend on rounding mode when executed with a reserved rounding mode
is reserved, including both static reserved rounding modes (101–110)
and dynamic reserved rounding modes (101–111). Some instructions,
including widening conversions, have the rm field but are nevertheless
mathematically unaffected by the rounding mode; software should set
their rm field to RNE (000) but implementations must treat the rm
field as usual (in particular, with regard to decoding legal vs.
reserved encodings).


Rounding Mode
Mnemonic
Meaning


000
RNE
Round to Nearest, ties to Even


001
RTZ
Round towards Zero


010
RDN
Round Down (towards  − ∞)


011
RUP
Round Up (towards  + ∞)


100
RMM
Round to Nearest, ties to Max Magnitude


101

Reserved for future use.


110

Reserved for future use.


111
DYN
In instruction’s rm field, selects dynamic rounding mode;


In Rounding Mode register, reserved.


Rounding mode encoding.


The C99 language standard effectively mandates the provision of a
dynamic rounding mode register. In typical implementations, writes to
the dynamic rounding mode CSR state will serialize the pipeline. Static
rounding modes are used to implement specialized arithmetic operations
that often have to switch frequently between different rounding modes.
The ratified version of the F spec mandated that an illegal instruction
exception was raised when an instruction was executed with a reserved
dynamic rounding mode. This has been weakened to reserved, which matches
the behavior of static rounding-mode instructions. Raising an illegal
instruction exception is still valid behavior when encountering a
reserved encoding, so implementations compatible with the ratified spec
are compatible with the weakened spec.

The accrued exception flags indicate the exception conditions that have
arisen on any floating-point arithmetic instruction since the field was
last reset by software, as shown in
Table 1.2. The base RISC-V ISA does not support
generating a trap on the setting of a floating-point exception flag.


Flag Mnemonic
Flag Meaning


NV
Invalid Operation


DZ
Divide by Zero


OF
Overflow


UF
Underflow


NX
Inexact


Accrued exception flag encoding.


As allowed by the standard, we do not support traps on floating-point
exceptions in the F extension, but instead require explicit checks of
the flags in software. We considered adding branches controlled directly
by the contents of the floating-point accrued exception flags, but
ultimately chose to omit these instructions to keep the ISA simple.

NaN Generation and Propagation

Except when otherwise stated, if the result of a floating-point
operation is NaN, it is the canonical NaN. The canonical NaN has a
positive sign and all significand bits clear except the MSB, a.k.a. the
quiet bit. For single-precision floating-point, this corresponds to the
pattern  0x7fc00000.

We considered propagating NaN payloads, as is recommended by the
standard, but this decision would have increased hardware cost.
Moreover, since this feature is optional in the standard, it cannot be
used in portable code.
Implementors are free to provide a NaN payload propagation scheme as a
non-standard extension enabled by a non-standard operating mode.
However, the canonical NaN scheme described above must always be
supported and should be the default mode.


We require implementations to return the standard-mandated default
values in the case of exceptional conditions, without any further
intervention on the part of user-level software (unlike the Alpha ISA
floating-point trap barriers). We believe full hardware handling of
exceptional cases will become more common, and so wish to avoid
complicating the user-level ISA to optimize other approaches.
Implementations can always trap to machine-mode software handlers to
provide exceptional default values.

Subnormal Arithmetic

Operations on subnormal numbers are handled in accordance with the IEEE
754-2008 standard.
In the parlance of the IEEE standard, tininess is detected after
rounding.

Detecting tininess after rounding results in fewer spurious underflow
signals.

Single-Precision Load and Store Instructions

Floating-point loads and stores use the same base+offset addressing mode
as the integer base ISAs, with a base address in register rs1 and a
12-bit signed byte offset. The FLW instruction loads a single-precision
floating-point value from memory into floating-point register rd. FSW
stores a single-precision value from floating-point register rs2 to
memory.


| M | R | F | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | W | dest | LOAD-FP


| O | R | R | F | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| offset[11:5] | src | base | W | offset[4:0] | STORE-FP


FLW and FSW are only guaranteed to execute atomically if the effective
address is naturally aligned.
FLW and FSW do not modify the bits being transferred; in particular, the
payloads of non-canonical NaNs are preserved.
As described in
Section [sec:rv32:ldst], the EEI defines
whether misaligned floating-point loads and stores are handled invisibly
or raise a contained or fatal trap.
Single-Precision Floating-Point Computational Instructions

Floating-point arithmetic instructions with one or two source operands
use the R-type format with the OP-FP major opcode. FADD.S and FMUL.S
perform single-precision floating-point addition and multiplication
respectively, between rs1 and rs2. FSUB.S performs the
single-precision floating-point subtraction of rs2 from rs1. FDIV.S
performs the single-precision floating-point division of rs1 by rs2.
FSQRT.S computes the square root of rs1. In each case, the result is
written to rd.
The 2-bit floating-point format field fmt is encoded as shown in
Table 1.3. It is set to S (00) for all
instructions in the F extension.


fmt field
Mnemonic
Meaning


00
S
32-bit single-precision


01
D
64-bit double-precision


10
H
16-bit half-precision


11
Q
128-bit quad-precision


Format field encoding.


All floating-point operations that perform rounding can select the
rounding mode using the rm field with the encoding shown in
Table 1.1.
Floating-point minimum-number and maximum-number instructions FMIN.S and
FMAX.S write, respectively, the smaller or larger of rs1 and rs2 to
rd. For the purposes of these instructions only, the value  − 0.0 is
considered to be less than the value  + 0.0. If both inputs are NaNs,
the result is the canonical NaN. If only one operand is a NaN, the
result is the non-NaN operand. Signaling NaN inputs set the invalid
operation exception flag, even when the result is not NaN.

Note that in version 2.2 of the F extension, the FMIN.S and FMAX.S
instructions were amended to implement the proposed IEEE 754-201x
minimumNumber and maximumNumber operations, rather than the IEEE
754-2008 minNum and maxNum operations. These operations differ in their
handling of signaling NaNs.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FADD/FSUB | S | src2 | src1 | RM | dest | OP-FP

| FMUL/FDIV | S | src2 | src1 | RM | dest | OP-FP

| FSQRT | S | 0 | src | RM | dest | OP-FP

| FMIN-MAX | S | src2 | src1 | MIN/MAX | dest | OP-FP


Floating-point fused multiply-add instructions require a new standard
instruction format. R4-type instructions specify three source registers
(rs1, rs2, and rs3) and a destination register (rd). This format
is only used by the floating-point fused multiply-add instructions.
FMADD.S multiplies the values in rs1 and rs2, adds the value in
rs3, and writes the final result to rd. FMADD.S computes
(rs1×rs2)+rs3.
FMSUB.S multiplies the values in rs1 and rs2, subtracts the value in
rs3, and writes the final result to rd. FMSUB.S computes
(rs1×rs2)-rs3.
FNMSUB.S multiplies the values in rs1 and rs2, negates the product,
adds the value in rs3, and writes the final result to rd. FNMSUB.S
computes -(rs1×rs2)+rs3.
FNMADD.S multiplies the values in rs1 and rs2, negates the product,
subtracts the value in rs3, and writes the final result to rd.
FNMADD.S computes -(rs1×rs2)-rs3.

The FNMSUB and FNMADD instructions are counterintuitively named, owing
to the naming of the corresponding instructions in MIPS-IV. The MIPS
instructions were defined to negate the sum, rather than negating the
product as the RISC-V instructions do, so the naming scheme was more
rational at the time. The two definitions differ with respect to
signed-zero results. The RISC-V definition matches the behavior of the
x86 and ARM fused multiply-add instructions, but unfortunately the
RISC-V FNMSUB and FNMADD instruction names are swapped compared to x86
and ARM.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| src3 | S | src2 | src1 | RM | dest | F[N]MADD/F[N]MSUB


The fused multiply-add (FMA) instructions consume a large part of the
32-bit instruction encoding space. Some alternatives considered were to
restrict FMA to only use dynamic rounding modes, but static rounding
modes are useful in code that exploits the lack of product rounding.
Another alternative would have been to use rd to provide rs3, but this
would require additional move instructions in some common sequences. The
current design still leaves a large portion of the 32-bit encoding space
open while avoiding having FMA be non-orthogonal.

The fused multiply-add instructions must set the invalid operation
exception flag when the multiplicands are ∞ and zero, even when the
addend is a quiet NaN.

The IEEE 754-2008 standard permits, but does not require, raising the
invalid exception for the operation ∞ × 0 + qNaN.

Single-Precision Floating-Point Conversion and Move Instructions

Floating-point-to-integer and integer-to-floating-point conversion
instructions are encoded in the OP-FP major opcode space. FCVT.W.S or
FCVT.L.S converts a floating-point number in floating-point register
rs1 to a signed 32-bit or 64-bit integer, respectively, in integer
register rd. FCVT.S.W or FCVT.S.L converts a 32-bit or 64-bit signed
integer, respectively, in integer register rs1 into a floating-point
number in floating-point register rd. FCVT.WU.S, FCVT.LU.S, FCVT.S.WU,
and FCVT.S.LU variants convert to or from unsigned integer values. For
XLEN > 32, FCVT.W[U].S sign-extends the 32-bit result to the
destination register width. FCVT.L[U].S and FCVT.S.L[U] are
RV64-only instructions. If the rounded result is not representable in
the destination format, it is clipped to the nearest value and the
invalid flag is set.
Table 1.4 gives the range of valid inputs
for FCVT.int.S and the behavior for invalid inputs.


FCVT.W.S
FCVT.WU.S
FCVT.L.S
FCVT.LU.S


Minimum valid input (after rounding)
 − 2³¹
0
 − 2⁶³
0


Maximum valid input (after rounding)
2³¹ − 1
2³² − 1
2⁶³ − 1
2⁶⁴ − 1


Output for out-of-range negative input
 − 2³¹
0
 − 2⁶³
0


Output for  − ∞
 − 2³¹
0
 − 2⁶³
0


Output for out-of-range positive input
2³¹ − 1
2³² − 1
2⁶³ − 1
2⁶⁴ − 1


Output for  + ∞ or NaN
2³¹ − 1
2³² − 1
2⁶³ − 1
2⁶⁴ − 1


Domains of float-to-integer conversions and behavior for invalid inputs.


All floating-point to integer and integer to floating-point conversion
instructions round according to the rm field. A floating-point
register can be initialized to floating-point positive zero using
FCVT.S.W rd, x0, which will never set any exception flags.
All floating-point conversion instructions set the Inexact exception
flag if the rounded result differs from the operand value and the
Invalid exception flag is not set.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.int.fmt | S | W[U]/L[U] | src | RM | dest | OP-FP

| FCVT.fmt.int | S | W[U]/L[U] | src | RM | dest | OP-FP


Floating-point to floating-point sign-injection instructions, FSGNJ.S,
FSGNJN.S, and FSGNJX.S, produce a result that takes all bits except the
sign bit from rs1. For FSGNJ, the result’s sign bit is rs2’s sign
bit; for FSGNJN, the result’s sign bit is the opposite of rs2’s sign
bit; and for FSGNJX, the sign bit is the XOR of the sign bits of rs1
and rs2. Sign-injection instructions do not set floating-point
exception flags, nor do they canonicalize NaNs. Note, FSGNJ.S rx, ry,
ry moves ry to rx (assembler pseudoinstruction FMV.S rx, ry);
FSGNJN.S rx, ry, ry moves the negation of ry to rx (assembler
pseudoinstruction FNEG.S rx, ry); and FSGNJX.S rx, ry, ry moves the
absolute value of ry to rx (assembler pseudoinstruction FABS.S rx,
ry).


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FSGNJ | S | src2 | src1 | J[N]/JX | dest | OP-FP


The sign-injection instructions provide floating-point MV, ABS, and NEG,
as well as supporting a few other operations, including the IEEE
copySign operation and sign manipulation in transcendental math function
libraries. Although MV, ABS, and NEG only need a single register
operand, whereas FSGNJ instructions need two, it is unlikely most
microarchitectures would add optimizations to benefit from the reduced
number of register reads for these relatively infrequent instructions.
Even in this case, a microarchitecture can simply detect when both
source registers are the same for FSGNJ instructions and only read a
single copy.

Instructions are provided to move bit patterns between the
floating-point and integer registers. FMV.X.W moves the single-precision
value in floating-point register rs1 represented in IEEE 754-2008
encoding to the lower 32 bits of integer register rd. The bits are not
modified in the transfer, and in particular, the payloads of
non-canonical NaNs are preserved. For RV64, the higher 32 bits of the
destination register are filled with copies of the floating-point
number’s sign bit.
FMV.W.X moves the single-precision value encoded in IEEE 754-2008
standard encoding from the lower 32 bits of integer register rs1 to
the floating-point register rd. The bits are not modified in the
transfer, and in particular, the payloads of non-canonical NaNs are
preserved.

The FMV.W.X and FMV.X.W instructions were previously called FMV.S.X and
FMV.X.S. The use of W is more consistent with their semantics as an
instruction that moves 32 bits without interpreting them. This became
clearer after defining NaN-boxing. To avoid disturbing existing code,
both the W and S versions will be supported by tools.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FMV.X.W | S | 0 | src | 000 | dest | OP-FP

| FMV.W.X | S | 0 | src | 000 | dest | OP-FP


The base floating-point ISA was defined so as to allow implementations
to employ an internal recoding of the floating-point format in registers
to simplify handling of subnormal values and possibly to reduce
functional unit latency. To this end, the F extension avoids
representing integer values in the floating-point registers by defining
conversion and comparison operations that read and write the integer
register file directly. This also removes many of the common cases where
explicit moves between integer and floating-point registers are
required, reducing instruction count and critical paths for common
mixed-format code sequences.

Single-Precision Floating-Point Compare Instructions

Floating-point compare instructions (FEQ.S, FLT.S, FLE.S) perform the
specified comparison between floating-point registers ($\mbox{\em rs1}
= \mbox{\em rs2}$, $\mbox{\em rs1} &lt; \mbox{\em rs2}$,
$\mbox{\em rs1} \leq
\mbox{\em rs2}$) writing 1 to the integer register rd if the
condition holds, and 0 otherwise.
FLT.S and FLE.S perform what the IEEE 754-2008 standard refers to as
signaling comparisons: that is, they set the invalid operation
exception flag if either input is NaN. FEQ.S performs a quiet
comparison: it only sets the invalid operation exception flag if either
input is a signaling NaN. For all three instructions, the result is 0 if
either operand is NaN.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCMP | S | src2 | src1 | EQ/LT/LE | dest | OP-FP


The F extension provides a ≤ comparison, whereas the base ISAs provide a
≥ branch comparison. Because ≤ can be synthesized from ≥ and vice-versa,
there is no performance implication to this inconsistency, but it is
nevertheless an unfortunate incongruity in the ISA.

Single-Precision Floating-Point Classify Instruction

The FCLASS.S instruction examines the value in floating-point register
rs1 and writes to integer register rd a 10-bit mask that indicates
the class of the floating-point number. The format of the mask is
described in Table 1.5. The corresponding bit in rd will
be set if the property is true and clear otherwise. All other bits in
rd are cleared. Note that exactly one bit in rd will be set.
FCLASS.S does not set the floating-point exception flags.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCLASS | S | 0 | src | 001 | dest | OP-FP


rd bit
Meaning


0
rs1 is  − ∞.


1
rs1 is a negative normal number.


2
rs1 is a negative subnormal number.


3
rs1 is  − 0.


4
rs1 is  + 0.


5
rs1 is a positive subnormal number.


6
rs1 is a positive normal number.


7
rs1 is  + ∞.


8
rs1 is a signaling NaN.


9
rs1 is a quiet NaN.


Format of result of FCLASS instruction.


# “D” Standard Extension for Double-Precision Floating-Point, Version 2.2
This chapter describes the standard double-precision floating-point
instruction-set extension, which is named “D” and adds double-precision
floating-point computational instructions compliant with the IEEE
754-2008 arithmetic standard. The D extension depends on the base
single-precision instruction subset F.
D Register State

The D extension widens the 32 floating-point registers, f0– f31, to
64 bits (FLEN=64 in Figure [fprs]). The f registers can now hold either
32-bit or 64-bit floating-point values as described below in
Section 1.2.

FLEN can be 32, 64, or 128 depending on which of the F, D, and Q
extensions are supported. There can be up to four different
floating-point precisions supported, including H, F, D, and Q.

NaN Boxing of Narrower Values

When multiple floating-point precisions are supported, then valid values
of narrower n-bit types, n< FLEN, are represented in the lower n
bits of an FLEN-bit NaN value, in a process termed NaN-boxing. The upper
bits of a valid NaN-boxed value must be all 1s. Valid NaN-boxed n-bit
values therefore appear as negative quiet NaNs (qNaNs) when viewed as
any wider m-bit value, n < m≤ FLEN. Any operation that writes a
narrower result to an f register must write all 1s to the uppermost
FLEN − n bits to yield a legal NaN-boxed value.

Software might not know the current type of data stored in a
floating-point register but has to be able to save and restore the
register values, hence the result of using wider operations to transfer
narrower values has to be defined. A common case is for callee-saved
registers, but a standard convention is also desirable for features
including varargs, user-level threading libraries, virtual machine
migration, and debugging.

Floating-point n-bit transfer operations move external values held in
IEEE standard formats into and out of the f registers, and comprise
floating-point loads and stores (FLn/FSn) and floating-point move
instructions (FMV.n.X/FMV.X.n). A narrower n-bit transfer,
n< FLEN, into the f registers will create a valid NaN-boxed value.
A narrower n-bit transfer out of the floating-point registers will
transfer the lower n bits of the register ignoring the upper
FLEN − n bits.
Apart from transfer operations described in the previous paragraph, all
other floating-point operations on narrower n-bit operations,
n< FLEN, check if the input operands are correctly NaN-boxed, i.e.,
all upper FLEN − n bits are 1. If so, the n least-significant bits
of the input are used as the input value, otherwise the input value is
treated as an n-bit canonical NaN.

Earlier versions of this document did not define the behavior of feeding
the results of narrower or wider operands into an operation, except to
require that wider saves and restores would preserve the value of a
narrower operand. The new definition removes this
implementation-specific behavior, while still accommodating both
non-recoded and recoded implementations of the floating-point unit. The
new definition also helps catch software errors by propagating NaNs if
values are used incorrectly.
Non-recoded implementations unpack and pack the operands to IEEE
standard format on the input and output of every floating-point
operation. The NaN-boxing cost to a non-recoded implementation is
primarily in checking if the upper bits of a narrower operation
represent a legal NaN-boxed value, and in writing all 1s to the upper
bits of a result.
Recoded implementations use a more convenient internal format to
represent floating-point values, with an added exponent bit to allow all
values to be held normalized. The cost to the recoded implementation is
primarily the extra tagging needed to track the internal types and sign
bits, but this can be done without adding new state bits by recoding
NaNs internally in the exponent field. Small modifications are needed to
the pipelines used to transfer values in and out of the recoded format,
but the datapath and latency costs are minimal. The recoding process has
to handle shifting of input subnormal values for wide operands in any
case, and extracting the NaN-boxed value is a similar process to
normalization except for skipping over leading-1 bits instead of
skipping over leading-0 bits, allowing the datapath muxing to be shared.

Double-Precision Load and Store Instructions

The FLD instruction loads a double-precision floating-point value from
memory into floating-point register rd. FSD stores a double-precision
value from the floating-point registers to memory.

The double-precision value may be a NaN-boxed single-precision value.


| M | R | F | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | D | dest | LOAD-FP


| O | R | R | F | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| offset[11:5] | src | base | D | offset[4:0] | STORE-FP


FLD and FSD are only guaranteed to execute atomically if the effective
address is naturally aligned and XLEN≥64.
FLD and FSD do not modify the bits being transferred; in particular, the
payloads of non-canonical NaNs are preserved.
Double-Precision Floating-Point Computational Instructions

The double-precision floating-point computational instructions are
defined analogously to their single-precision counterparts, but operate
on double-precision operands and produce double-precision results.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FADD/FSUB | D | src2 | src1 | RM | dest | OP-FP

| FMUL/FDIV | D | src2 | src1 | RM | dest | OP-FP

| FMIN-MAX | D | src2 | src1 | MIN/MAX | dest | OP-FP

| FSQRT | D | 0 | src | RM | dest | OP-FP


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| src3 | D | src2 | src1 | RM | dest | F[N]MADD/F[N]MSUB


Double-Precision Floating-Point Conversion and Move Instructions

Floating-point-to-integer and integer-to-floating-point conversion
instructions are encoded in the OP-FP major opcode space. FCVT.W.D or
FCVT.L.D converts a double-precision floating-point number in
floating-point register rs1 to a signed 32-bit or 64-bit integer,
respectively, in integer register rd. FCVT.D.W or FCVT.D.L converts a
32-bit or 64-bit signed integer, respectively, in integer register rs1
into a double-precision floating-point number in floating-point register
rd. FCVT.WU.D, FCVT.LU.D, FCVT.D.WU, and FCVT.D.LU variants convert to
or from unsigned integer values. For RV64, FCVT.W[U].D sign-extends
the 32-bit result. FCVT.L[U].D and FCVT.D.L[U] are RV64-only
instructions. The range of valid inputs for FCVT.int.D and the
behavior for invalid inputs are the same as for FCVT.int.S.
All floating-point to integer and integer to floating-point conversion
instructions round according to the rm field. Note FCVT.D.W[U]
always produces an exact result and is unaffected by rounding mode.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.int.D | D | W[U]/L[U] | src | RM | dest | OP-FP

| FCVT.D.int | D | W[U]/L[U] | src | RM | dest | OP-FP


The double-precision to single-precision and single-precision to
double-precision conversion instructions, FCVT.S.D and FCVT.D.S, are
encoded in the OP-FP major opcode space and both the source and
destination are floating-point registers. The rs2 field encodes the
datatype of the source, and the fmt field encodes the datatype of the
destination. FCVT.S.D rounds according to the RM field; FCVT.D.S will
never round.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.S.D | S | D | src | RM | dest | OP-FP

| FCVT.D.S | D | S | src | RM | dest | OP-FP


Floating-point to floating-point sign-injection instructions, FSGNJ.D,
FSGNJN.D, and FSGNJX.D are defined analogously to the single-precision
sign-injection instruction.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FSGNJ | D | src2 | src1 | J[N]/JX | dest | OP-FP


For XLEN≥64 only, instructions are provided to move bit patterns
between the floating-point and integer registers. FMV.X.D moves the
double-precision value in floating-point register rs1 to a
representation in IEEE 754-2008 standard encoding in integer register
rd. FMV.D.X moves the double-precision value encoded in IEEE 754-2008
standard encoding from the integer register rs1 to the floating-point
register rd.
FMV.X.D and FMV.D.X do not modify the bits being transferred; in
particular, the payloads of non-canonical NaNs are preserved.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FMV.X.D | D | 0 | src | 000 | dest | OP-FP

| FMV.D.X | D | 0 | src | 000 | dest | OP-FP


Early versions of the RISC-V ISA had additional instructions to allow
RV32 systems to transfer between the upper and lower portions of a
64-bit floating-point register and an integer register. However, these
would be the only instructions with partial register writes and would
add complexity in implementations with recoded floating-point or
register renaming, requiring a pipeline read-modify-write sequence.
Scaling up to handling quad-precision for RV32 and RV64 would also
require additional instructions if they were to follow this pattern. The
ISA was defined to reduce the number of explicit int-float register
moves, by having conversions and comparisons write results to the
appropriate register file, so we expect the benefit of these
instructions to be lower than for other ISAs.
We note that for systems that implement a 64-bit floating-point unit
including fused multiply-add support and 64-bit floating-point loads and
stores, the marginal hardware cost of moving from a 32-bit to a 64-bit
integer datapath is low, and a software ABI supporting 32-bit wide
address-space and pointers can be used to avoid growth of static data
and dynamic memory traffic.

Double-Precision Floating-Point Compare Instructions

The double-precision floating-point compare instructions are defined
analogously to their single-precision counterparts, but operate on
double-precision operands.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCMP | D | src2 | src1 | EQ/LT/LE | dest | OP-FP


Double-Precision Floating-Point Classify Instruction

The double-precision floating-point classify instruction, FCLASS.D, is
defined analogously to its single-precision counterpart, but operates on
double-precision operands.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCLASS | D | 0 | src | 001 | dest | OP-FP


# “Q” Standard Extension for Quad-Precision Floating-Point, Version 2.2
This chapter describes the Q standard extension for 128-bit
quad-precision binary floating-point instructions compliant with the
IEEE 754-2008 arithmetic standard. The quad-precision binary
floating-point instruction-set extension is named “Q”; it depends on the
double-precision floating-point extension D. The floating-point
registers are now extended to hold either a single, double, or
quad-precision floating-point value (FLEN=128). The NaN-boxing scheme
described in Section [nanboxing] is now extended recursively
to allow a single-precision value to be NaN-boxed inside a
double-precision value which is itself NaN-boxed inside a quad-precision
value.
Quad-Precision Load and Store Instructions

New 128-bit variants of LOAD-FP and STORE-FP instructions are added,
encoded with a new value for the funct3 width field.


| M | R | F | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | Q | dest | LOAD-FP


| O | R | R | F | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| offset[11:5] | src | base | Q | offset[4:0] | STORE-FP


FLQ and FSQ are only guaranteed to execute atomically if the effective
address is naturally aligned and XLEN=128.
FLQ and FSQ do not modify the bits being transferred; in particular, the
payloads of non-canonical NaNs are preserved.
Quad-Precision Computational Instructions

A new supported format is added to the format field of most
instructions, as shown in
Table 1.1.


fmt field
Mnemonic
Meaning


00
S
32-bit single-precision


01
D
64-bit double-precision


10
H
16-bit half-precision


11
Q
128-bit quad-precision


Format field encoding.


The quad-precision floating-point computational instructions are defined
analogously to their double-precision counterparts, but operate on
quad-precision operands and produce quad-precision results.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FADD/FSUB | Q | src2 | src1 | RM | dest | OP-FP

| FMUL/FDIV | Q | src2 | src1 | RM | dest | OP-FP

| FMIN-MAX | Q | src2 | src1 | MIN/MAX | dest | OP-FP

| FSQRT | Q | 0 | src | RM | dest | OP-FP


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| src3 | Q | src2 | src1 | RM | dest | F[N]MADD/F[N]MSUB


Quad-Precision Conversion and Move Instructions

New floating-point-to-integer and integer-to-floating-point conversion
instructions are added. These instructions are defined analogously to
the double-precision-to-integer and integer-to-double-precision
conversion instructions. FCVT.W.Q or FCVT.L.Q converts a quad-precision
floating-point number to a signed 32-bit or 64-bit integer,
respectively. FCVT.Q.W or FCVT.Q.L converts a 32-bit or 64-bit signed
integer, respectively, into a quad-precision floating-point number.
FCVT.WU.Q, FCVT.LU.Q, FCVT.Q.WU, and FCVT.Q.LU variants convert to or
from unsigned integer values. FCVT.L[U].Q and FCVT.Q.L[U] are
RV64-only instructions. Note FCVT.Q.L[U] always produces an exact
result and is unaffected by rounding mode.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.int.Q | Q | W[U]/L[U] | src | RM | dest | OP-FP

| FCVT.Q.int | Q | W[U]/L[U] | src | RM | dest | OP-FP


New floating-point-to-floating-point conversion instructions are added.
These instructions are defined analogously to the double-precision
floating-point-to-floating-point conversion instructions. FCVT.S.Q or
FCVT.Q.S converts a quad-precision floating-point number to a
single-precision floating-point number, or vice-versa, respectively.
FCVT.D.Q or FCVT.Q.D converts a quad-precision floating-point number to
a double-precision floating-point number, or vice-versa, respectively.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.S.Q | S | Q | src | RM | dest | OP-FP

| FCVT.Q.S | Q | S | src | RM | dest | OP-FP

| FCVT.D.Q | D | Q | src | RM | dest | OP-FP

| FCVT.Q.D | Q | D | src | RM | dest | OP-FP


Floating-point to floating-point sign-injection instructions, FSGNJ.Q,
FSGNJN.Q, and FSGNJX.Q are defined analogously to the double-precision
sign-injection instruction.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FSGNJ | Q | src2 | src1 | J[N]/JX | dest | OP-FP


FMV.X.Q and FMV.Q.X instructions are not provided in RV32 or RV64, so
quad-precision bit patterns must be moved to the integer registers via
memory.

RV128 will support FMV.X.Q and FMV.Q.X in the Q extension.

Quad-Precision Floating-Point Compare Instructions

The quad-precision floating-point compare instructions are defined
analogously to their double-precision counterparts, but operate on
quad-precision operands.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCMP | Q | src2 | src1 | EQ/LT/LE | dest | OP-FP


Quad-Precision Floating-Point Classify Instruction

The quad-precision floating-point classify instruction, FCLASS.Q, is
defined analogously to its double-precision counterpart, but operates on
quad-precision operands.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCLASS | Q | 0 | src | 001 | dest | OP-FP


# “Zfh” and “Zfhmin” Standard Extensions for Half-Precision Floating-Point, Version 1.0
This chapter describes the Zfh standard extension for 16-bit
half-precision binary floating-point instructions compliant with the
IEEE 754-2008 arithmetic standard. The Zfh extension depends on the
single-precision floating-point extension, F. The NaN-boxing scheme
described in Section [nanboxing] is extended to allow a
half-precision value to be NaN-boxed inside a single-precision value
(which may be recursively NaN-boxed inside a double- or quad-precision
value when the D or Q extension is present).

This extension primarily provides instructions that consume
half-precision operands and produce half-precision results. However, it
is also common to compute on half-precision data using higher
intermediate precision. Although this extension provides explicit
conversion instructions that suffice to implement that pattern, future
extensions might further accelerate such computation with additional
instructions that implicitly widen their operands—e.g.,
half×half+single→single—or implicitly narrow their results—e.g.,
half+single→half.

Half-Precision Load and Store Instructions

New 16-bit variants of LOAD-FP and STORE-FP instructions are added,
encoded with a new value for the funct3 width field.


| M | R | F | R | O

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| offset[11:0] | base | H | dest | LOAD-FP


| O | R | R | F | R | O

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| offset[11:5] | src | base | H | offset[4:0] | STORE-FP


FLH and FSH are only guaranteed to execute atomically if the effective
address is naturally aligned.
FLH and FSH do not modify the bits being transferred; in particular, the
payloads of non-canonical NaNs are preserved. FLH NaN-boxes the result
written to rd, whereas FSH ignores all but the lower 16 bits in rs2.
Half-Precision Computational Instructions

A new supported format is added to the format field of most
instructions, as shown in
Table 1.1.


fmt field
Mnemonic
Meaning


00
S
32-bit single-precision


01
D
64-bit double-precision


10
H
16-bit half-precision


11
Q
128-bit quad-precision


Format field encoding.


The half-precision floating-point computational instructions are defined
analogously to their single-precision counterparts, but operate on
half-precision operands and produce half-precision results.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FADD/FSUB | H | src2 | src1 | RM | dest | OP-FP

| FMUL/FDIV | H | src2 | src1 | RM | dest | OP-FP

| FMIN-MAX | H | src2 | src1 | MIN/MAX | dest | OP-FP

| FSQRT | H | 0 | src | RM | dest | OP-FP


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| src3 | H | src2 | src1 | RM | dest | F[N]MADD/F[N]MSUB


Half-Precision Conversion and Move Instructions

New floating-point-to-integer and integer-to-floating-point conversion
instructions are added. These instructions are defined analogously to
the single-precision-to-integer and integer-to-single-precision
conversion instructions. FCVT.W.H or FCVT.L.H converts a half-precision
floating-point number to a signed 32-bit or 64-bit integer,
respectively. FCVT.H.W or FCVT.H.L converts a 32-bit or 64-bit signed
integer, respectively, into a half-precision floating-point number.
FCVT.WU.H, FCVT.LU.H, FCVT.H.WU, and FCVT.H.LU variants convert to or
from unsigned integer values. FCVT.L[U].H and FCVT.H.L[U] are
RV64-only instructions.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.int.H | H | W[U]/L[U] | src | RM | dest | OP-FP

| FCVT.H.int | H | W[U]/L[U] | src | RM | dest | OP-FP


New floating-point-to-floating-point conversion instructions are added.
These instructions are defined analogously to the double-precision
floating-point-to-floating-point conversion instructions. FCVT.S.H or
FCVT.H.S converts a half-precision floating-point number to a
single-precision floating-point number, or vice-versa, respectively. If
the D extension is present, FCVT.D.H or FCVT.H.D converts a
half-precision floating-point number to a double-precision
floating-point number, or vice-versa, respectively. If the Q extension
is present, FCVT.Q.H or FCVT.H.Q converts a half-precision
floating-point number to a quad-precision floating-point number, or
vice-versa, respectively.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCVT.S.H | S | H | src | RM | dest | OP-FP

| FCVT.H.S | H | S | src | RM | dest | OP-FP

| FCVT.D.H | D | H | src | RM | dest | OP-FP

| FCVT.H.D | H | D | src | RM | dest | OP-FP

| FCVT.Q.H | Q | H | src | RM | dest | OP-FP

| FCVT.H.Q | H | Q | src | RM | dest | OP-FP


Floating-point to floating-point sign-injection instructions, FSGNJ.H,
FSGNJN.H, and FSGNJX.H are defined analogously to the single-precision
sign-injection instruction.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FSGNJ | H | src2 | src1 | J[N]/JX | dest | OP-FP


Instructions are provided to move bit patterns between the
floating-point and integer registers. FMV.X.H moves the half-precision
value in floating-point register rs1 to a representation in IEEE
754-2008 standard encoding in integer register rd, filling the upper
XLEN-16 bits with copies of the floating-point number’s sign bit.
FMV.H.X moves the half-precision value encoded in IEEE 754-2008 standard
encoding from the lower 16 bits of integer register rs1 to the
floating-point register rd, NaN-boxing the result.
FMV.X.H and FMV.H.X do not modify the bits being transferred; in
particular, the payloads of non-canonical NaNs are preserved.


| R | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FMV.X.H | H | 0 | src | 000 | dest | OP-FP

| FMV.H.X | H | 0 | src | 000 | dest | OP-FP


Half-Precision Floating-Point Compare Instructions

The half-precision floating-point compare instructions are defined
analogously to their single-precision counterparts, but operate on
half-precision operands.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCMP | H | src2 | src1 | EQ/LT/LE | dest | OP-FP


Half-Precision Floating-Point Classify Instruction

The half-precision floating-point classify instruction, FCLASS.H, is
defined analogously to its single-precision counterpart, but operates on
half-precision operands.


| S | F | R | R | F | R | O

|:- |:- |:- |:- |:- |:-
| | | | | | |

| | | | | | |

| | 2 | 5 | 5 | 3 | 5 | 7

| FCLASS | H | 0 | src | 001 | dest | OP-FP


“Zfhmin” Standard Extension for Minimal Half-Precision Floating-Point Support

This section describes the Zfhmin standard extension, which provides
minimal support for 16-bit half-precision binary floating-point
instructions. The Zfhmin extension is a subset of the Zfh extension,
consisting only of data transfer and conversion instructions. Like Zfh,
the Zfhmin extension depends on the single-precision floating-point
extension, F. The expectation is that Zfhmin software primarily uses the
half-precision format for storage, performing most computation in higher
precision.
The Zfhmin extension includes the following instructions from the Zfh
extension: FLH, FSH, FMV.X.H, FMV.H.X, FCVT.S.H, and FCVT.H.S. If the D
extension is present, the FCVT.D.H and FCVT.H.D instructions are also
included. If the Q extension is present, the FCVT.Q.H and FCVT.H.Q
instructions are additionally included.

Zfhmin does not include the FSGNJ.H instruction, because it suffices to
instead use the FSGNJ.S instruction to move half-precision values
between floating-point registers.


Half-precision addition, subtraction, multiplication, division, and
square-root operations can be faithfully emulated by converting the
half-precision operands to single-precision, performing the operation
using single-precision arithmetic, then converting back to
half-precision . Performing half-precision fused multiply-addition using
this method incurs a 1-ulp error on some inputs for the RNE and RMM
rounding modes.
Conversion from 8- or 16-bit integers to half-precision can be emulated
by first converting to single-precision, then converting to
half-precision. Conversion from 32-bit integer can be emulated by first
converting to double-precision. If the D extension is not present and a
1-ulp error under RNE or RMM is tolerable, 32-bit integers can be first
converted to single-precision instead. The same remark applies to
conversions from 64-bit integers without the Q extension.

# RVWMO Memory Consistency Model, Version 2.0
This chapter defines the RISC-V memory consistency model. A memory
consistency model is a set of rules specifying the values that can be
returned by loads of memory. RISC-V uses a memory model called “RVWMO”
(RISC-V Weak Memory Ordering) which is designed to provide flexibility
for architects to build high-performance scalable designs while
simultaneously supporting a tractable programming model.
Under RVWMO, code running on a single hart appears to execute in order
from the perspective of other memory instructions in the same hart, but
memory instructions from another hart may observe the memory
instructions from the first hart being executed in a different order.
Therefore, multithreaded code may require explicit synchronization to
guarantee ordering between memory instructions from different harts. The
base RISC-V ISA provides a FENCE instruction for this purpose, described
in Section [sec:fence], while the atomics extension
“A” additionally defines load-reserved/store-conditional and atomic
read-modify-write instructions.
The standard ISA extension for misaligned atomics “Zam”
(Chapter [sec:zam]) and the standard ISA extension
for total store ordering “Ztso”
(Chapter [sec:ztso]) augment RVWMO with additional
rules specific to those extensions.
The appendices to this specification provide both axiomatic and
operational formalizations of the memory consistency model as well as
additional explanatory material.

This chapter defines the memory model for regular main memory
operations. The interaction of the memory model with I/O memory,
instruction fetches, FENCE.I, page table walks, and SFENCE.VMA is not
(yet) formalized. Some or all of the above may be formalized in a future
revision of this specification. The RV128 base ISA and future ISA
extensions such as the “V” vector and “J” JIT extensions will need to be
incorporated into a future revision as well.
Memory consistency models supporting overlapping memory accesses of
different widths simultaneously remain an active area of academic
research and are not yet fully understood. The specifics of how memory
accesses of different sizes interact under RVWMO are specified to the
best of our current abilities, but they are subject to revision should
new issues be uncovered.

Definition of the RVWMO Memory Model

The RVWMO memory model is defined in terms of the global memory order,
a total ordering of the memory operations produced by all harts. In
general, a multithreaded program has many different possible executions,
with each execution having its own corresponding global memory order.
The global memory order is defined over the primitive load and store
operations generated by memory instructions. It is then subject to the
constraints defined in the rest of this chapter. Any execution
satisfying all of the memory model constraints is a legal execution (as
far as the memory model is concerned).
Memory Model Primitives

The program order over memory operations reflects the order in which
the instructions that generate each load and store are logically laid
out in that hart’s dynamic instruction stream; i.e., the order in which
a simple in-order processor would execute the instructions of that hart.
Memory-accessing instructions give rise to memory operations. A memory
operation can be either a load operation, a store operation, or both
simultaneously. All memory operations are single-copy atomic: they can
never be observed in a partially complete state.
Among instructions in RV32GC and RV64GC, each aligned memory instruction
gives rise to exactly one memory operation, with two exceptions. First,
an unsuccessful SC instruction does not give rise to any memory
operations. Second, FLD and FSD instructions may each give rise to
multiple memory operations if XLEN<64, as stated in
Section [fld_fsd] and clarified below. An aligned
AMO gives rise to a single memory operation that is both a load
operation and a store operation simultaneously.

Instructions in the RV128 base instruction set and in future ISA
extensions such as V (vector) and P (SIMD) may give rise to multiple
memory operations. However, the memory model for these extensions has
not yet been formalized.

A misaligned load or store instruction may be decomposed into a set of
component memory operations of any granularity. An FLD or FSD
instruction for which XLEN<64 may also be decomposed into a set
of component memory operations of any granularity. The memory operations
generated by such instructions are not ordered with respect to each
other in program order, but they are ordered normally with respect to
the memory operations generated by preceding and subsequent instructions
in program order. The atomics extension “A” does not require execution
environments to support misaligned atomic instructions at all; however,
if misaligned atomics are supported via the “Zam” extension, LRs, SCs,
and AMOs may be decomposed subject to the constraints of the atomicity
axiom for misaligned atomics, which is defined in
Chapter [sec:zam].

The decomposition of misaligned memory operations down to byte
granularity facilitates emulation on implementations that do not
natively support misaligned accesses. Such implementations might, for
example, simply iterate over the bytes of a misaligned access one by
one.

An LR instruction and an SC instruction are said to be paired if the
LR precedes the SC in program order and if there are no other LR or SC
instructions in between; the corresponding memory operations are said to
be paired as well (except in case of a failed SC, where no store
operation is generated). The complete list of conditions determining
whether an SC must succeed, may succeed, or must fail is defined in
Section [sec:lrsc].
Load and store operations may also carry one or more ordering
annotations from the following set: “acquire-RCpc”, “acquire-RCsc”,
“release-RCpc”, and “release-RCsc”. An AMO or LR instruction with aq
set has an “acquire-RCsc” annotation. An AMO or SC instruction with rl
set has a “release-RCsc” annotation. An AMO, LR, or SC instruction with
both aq and rl set has both “acquire-RCsc” and “release-RCsc”
annotations.
For convenience, we use the term “acquire annotation” to refer to an
acquire-RCpc annotation or an acquire-RCsc annotation. Likewise, a
“release annotation” refers to a release-RCpc annotation or a
release-RCsc annotation. An “RCpc annotation” refers to an acquire-RCpc
annotation or a release-RCpc annotation. An “RCsc annotation” refers to
an acquire-RCsc annotation or a release-RCsc annotation.

In the memory model literature, the term “RCpc” stands for release
consistency with processor-consistent synchronization operations, and
the term “RCsc” stands for release consistency with sequentially
consistent synchronization operations .
While there are many different definitions for acquire and release
annotations in the literature, in the context of RVWMO these terms are
concisely and completely defined by Preserved Program Order rules
[ppo:acquire]–[ppo:rcsc].
“RCpc” annotations are currently only used when implicitly assigned to
every memory access per the standard extension “Ztso”
(Chapter [sec:ztso]). Furthermore, although the ISA
does not currently contain native load-acquire or store-release
instructions, nor RCpc variants thereof, the RVWMO model itself is
designed to be forwards-compatible with the potential addition of any or
all of the above into the ISA in a future extension.

Syntactic Dependencies

The definition of the RVWMO memory model depends in part on the notion
of a syntactic dependency, defined as follows.
In the context of defining dependencies, a “register” refers either to
an entire general-purpose register, some portion of a CSR, or an entire
CSR. The granularity at which dependencies are tracked through CSRs is
specific to each CSR and is defined in
Section [sec:csr-granularity].
Syntactic dependencies are defined in terms of instructions’ source
registers, instructions’ destination registers, and the way
instructions carry a dependency from their source registers to their
destination registers. This section provides a general definition of all
of these terms; however,
Section [sec:source-dest-regs]
provides a complete listing of the specifics for each instruction.
In general, a register r other than x0 is a source register for an
instruction i if any of the following hold:


In the opcode of i, rs1, rs2, or rs3 is set to r


i is a CSR instruction, and in the opcode of i, csr is set to
r, unless i is CSRRW or CSRRWI and rd is set to x0


r is a CSR and an implicit source register for i, as defined in
Section [sec:source-dest-regs]


r is a CSR that aliases with another source register for i


Memory instructions also further specify which source registers are
address source registers and which are data source registers.
In general, a register r other than x0 is a destination register
for an instruction i if any of the following hold:


In the opcode of i, rd is set to r


i is a CSR instruction, and in the opcode of i, csr is set to
r, unless i is CSRRS or CSRRC and rs1 is set to x0 or i is
CSRRSI or CSRRCI and uimm[4:0] is set to zero.


r is a CSR and an implicit destination register for i, as
defined in
Section [sec:source-dest-regs]


r is a CSR that aliases with another destination register for i


Most non-memory instructions carry a dependency from each of their
source registers to each of their destination registers. However, there
are exceptions to this rule; see
Section [sec:source-dest-regs]
Instruction j has a syntactic dependency on instruction i via
destination register s of i and source register r of j if either
of the following hold:


s is the same as r, and no instruction program-ordered between
i and j has r as a destination register


There is an instruction m program-ordered between i and j such
that all of the following hold:


j has a syntactic dependency on m via destination register
q and source register r


m has a syntactic dependency on i via destination register
s and source register p


m carries a dependency from p to q


Finally, in the definitions that follow, let a and b be two memory
operations, and let i and j be the instructions that generate a
and b, respectively.
b has a syntactic address dependency on a if r is an address
source register for j and j has a syntactic dependency on i via
source register r
b has a syntactic data dependency on a if b is a store
operation, r is a data source register for j, and j has a
syntactic dependency on i via source register r
b has a syntactic control dependency on a if there is an
instruction m program-ordered between i and j such that m is a
branch or indirect jump and m has a syntactic dependency on i.

Generally speaking, non-AMO load instructions do not have data source
registers, and unconditional non-AMO store instructions do not have
destination registers. However, a successful SC instruction is
considered to have the register specified in rd as a destination
register, and hence it is possible for an instruction to have a
syntactic dependency on a successful SC instruction that precedes it in
program order.

Preserved Program Order

The global memory order for any given execution of a program respects
some but not all of each hart’s program order. The subset of program
order that must be respected by the global memory order is known as
preserved program order.
The complete definition of preserved program order is as follows (and
note that AMOs are simultaneously both loads and stores): memory
operation a precedes memory operation b in preserved program order
(and hence also in the global memory order) if a precedes b in
program order, a and b both access regular main memory (rather than
I/O regions), and any of the following hold:

Overlapping-Address Orderings:

|     1.   b is a store,
and a and b access overlapping memory addresses
2.  <span id="ppo:rdw" label="ppo:rdw"></span> *a* and *b* are
    loads, *x* is a byte read by both *a* and *b*, there is no store
    to *x* between *a* and *b* in program order, and *a* and *b*
    return values for *x* written by different memory operations

3.  <span id="ppo:amoforward" label="ppo:amoforward"></span> *a* is
    generated by an AMO or SC instruction, *b* is a load, and *b*
    returns a value written by *a*


Explicit Synchronization


 There is a FENCE
instruction that orders a before b


 a has an
acquire annotation


 b has a
release annotation


 a and b both
have RCsc annotations


 a is paired with
b


Syntactic Dependencies


 b has a syntactic
address dependency on a


 b has a syntactic
data dependency on a


 b is a store, and
b has a syntactic control dependency on a


Pipeline Dependencies


 b
is a load, and there exists some store m between a and b
in program order such that m has an address or data dependency
on a, and b returns a value written by m


 b is a store,
and there exists some instruction m between a and b in
program order such that m has an address dependency on a


Memory Model Axioms

An execution of a RISC-V program obeys the RVWMO memory consistency
model only if there exists a global memory order conforming to preserved
program order and satisfying the load value axiom, the atomicity
axiom, and the progress axiom.
Load Value Axiom

Each byte of each load i returns the value written to that byte by the
store that is the latest in global memory order among the following
stores:


Stores that write that byte and that precede i in the global
memory order


Stores that write that byte and that precede i in program order


Atomicity Axiom

If r and w are paired load and store operations generated by aligned
LR and SC instructions in a hart h, s is a store to byte x, and
r returns a value written by s, then s must precede w in the
global memory order, and there can be no store from a hart other than
h to byte x following s and preceding w in the global memory
order.

The theoretically supports LR/SC pairs of different widths and to
mismatched addresses, since implementations are permitted to allow SC
operations to succeed in such cases. However, in practice, we expect
such patterns to be rare, and their use is discouraged.

Progress Axiom

No memory operation may be preceded in the global memory order by an
infinite sequence of other memory operations.
CSR Dependency Tracking Granularity


Name
Portions Tracked as Independent Units
Aliases


fflags
Bits 4, 3, 2, 1, 0
fcsr


frm
entire CSR
fcsr


fcsr
Bits 7-5, 4, 3, 2, 1, 0
fflags, frm


Granularities at which syntactic dependencies are tracked through CSRs
Note: read-only CSRs are not listed, as they do not participate in the
definition of syntactic dependencies.
Source and Destination Register Listings

This section provides a concrete listing of the source and destination
registers for each instruction. These listings are used in the
definition of syntactic dependencies in
Section [sec:memorymodel:dependencies].
The term “accumulating CSR” is used to describe a CSR that is both a
source and a destination register, but which carries a dependency only
from itself to itself.
Instructions carry a dependency from each source register in the “Source
Registers” column to each destination register in the “Destination
Registers” column, from each source register in the “Source Registers”
column to each CSR in the “Accumulating CSRs” column, and from each CSR
in the “Accumulating CSRs” column to itself, except where annotated
otherwise.
Key:
^AAddress source register
^DData source register
^†The instruction does not carry a dependency from any source
register to any destination register
^‡The instruction carries dependencies from source register(s)
to destination register(s) as specified


RV32I Base Integer Instruction Set


Source
Destination
Accumulating


Registers
Registers
CSRs


LUI

rd


AUIPC

rd


JAL

rd


JALR^†
rs1
rd


BEQ
rs1, rs2


BNE
rs1, rs2


BLT
rs1, rs2


BGE
rs1, rs2


BLTU
rs1, rs2


BGEU
rs1, rs2


LB^†
rs1^A
rd


LH^†
rs1^A
rd


LW^†
rs1^A
rd


LBU^†
rs1^A
rd


LHU^†
rs1^A
rd


SB
rs1^A, rs2^D


SH
rs1^A, rs2^D


SW
rs1^A, rs2^D


ADDI
rs1
rd


SLTI
rs1
rd


SLTIU
rs1
rd


XORI
rs1
rd


ORI
rs1
rd


ANDI
rs1
rd


SLLI
rs1
rd


SRLI
rs1
rd


SRAI
rs1
rd


ADD
rs1, rs2
rd


SUB
rs1, rs2
rd


SLL
rs1, rs2
rd


SLT
rs1, rs2
rd


SLTU
rs1, rs2
rd


XOR
rs1, rs2
rd


SRL
rs1, rs2
rd


SRA
rs1, rs2
rd


OR
rs1, rs2
rd


AND
rs1, rs2
rd


FENCE


FENCE.I


ECALL


EBREAK


RV32I Base Integer Instruction Set (continued)


Source
Destination
Accumulating


Registers
Registers
CSRs


CSRRW^‡
rs1, csr^*
rd, csr

^*unless rd=x0


CSRRS^‡
rs1, csr
rd^*, csr

^*unless rs1=x0


CSRRC^‡
rs1, csr
rd^*, csr

^*unless rs1=x0


‡carries a dependency from rs1 to csr and from csr to rd


RV32I Base Integer Instruction Set (continued)


Source
Destination
Accumulating


Registers
Registers
CSRs


CSRRWI^‡
csr^*
rd, csr

^*unless rd=x0


CSRRSI^‡
csr
rd, csr^*

^*unless uimm[4:0]=0


CSRRCI^‡
csr
rd, csr^*

^*unless uimm[4:0]=0


‡carries a dependency from csr to rd


RV64I Base Integer Instruction Set


Source
Destination
Accumulating


Registers
Registers
CSRs


LWU^†
rs1^A
rd


LD^†
rs1^A
rd


SD
rs1^A, rs2^D


SLLI
rs1
rd


SRLI
rs1
rd


SRAI
rs1
rd


ADDIW
rs1
rd


SLLIW
rs1
rd


SRLIW
rs1
rd


SRAIW
rs1
rd


ADDW
rs1, rs2
rd


SUBW
rs1, rs2
rd


SLLW
rs1, rs2
rd


SRLW
rs1, rs2
rd


SRAW
rs1, rs2
rd


RV32M Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


MUL
rs1, rs2
rd


MULH
rs1, rs2
rd


MULHSU
rs1, rs2
rd


MULHU
rs1, rs2
rd


DIV
rs1, rs2
rd


DIVU
rs1, rs2
rd


REM
rs1, rs2
rd


REMU
rs1, rs2
rd


RV64M Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


MULW
rs1, rs2
rd


DIVW
rs1, rs2
rd


DIVUW
rs1, rs2
rd


REMW
rs1, rs2
rd


REMUW
rs1, rs2
rd


RV32A Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


LR.W^†
rs1^A
rd


SC.W^†
rs1^A, rs2^D
rd^*

^*if successful


AMOSWAP.W^†
rs1^A, rs2^D
rd


AMOADD.W^†
rs1^A, rs2^D
rd


AMOXOR.W^†
rs1^A, rs2^D
rd


AMOAND.W^†
rs1^A, rs2^D
rd


AMOOR.W^†
rs1^A, rs2^D
rd


AMOMIN.W^†
rs1^A, rs2^D
rd


AMOMAX.W^†
rs1^A, rs2^D
rd


AMOMINU.W^†
rs1^A, rs2^D
rd


AMOMAXU.W^†
rs1^A, rs2^D
rd


RV64A Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


LR.D^†
rs1^A
rd


SC.D^†
rs1^A, rs2^D
rd^*

^*if successful


AMOSWAP.D^†
rs1^A, rs2^D
rd


AMOADD.D^†
rs1^A, rs2^D
rd


AMOXOR.D^†
rs1^A, rs2^D
rd


AMOAND.D^†
rs1^A, rs2^D
rd


AMOOR.D^†
rs1^A, rs2^D
rd


AMOMIN.D^†
rs1^A, rs2^D
rd


AMOMAX.D^†
rs1^A, rs2^D
rd


AMOMINU.D^†
rs1^A, rs2^D
rd


AMOMAXU.D^†
rs1^A, rs2^D
rd


RV32F Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


FLW^†
rs1^A
rd


FSW
rs1^A, rs2^D


FMADD.S
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FMSUB.S
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FNMSUB.S
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FNMADD.S
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FADD.S
rs1, rs2, frm^*
rd
NV, OF, NX
^*if rm=111


FSUB.S
rs1, rs2, frm^*
rd
NV, OF, NX
^*if rm=111


FMUL.S
rs1, rs2, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FDIV.S
rs1, rs2, frm^*
rd
NV, DZ, OF, UF, NX
^*if rm=111


FSQRT.S
rs1, frm^*
rd
NV, NX
^*if rm=111


FSGNJ.S
rs1, rs2
rd


FSGNJN.S
rs1, rs2
rd


FSGNJX.S
rs1, rs2
rd


FMIN.S
rs1, rs2
rd
NV


FMAX.S
rs1, rs2
rd
NV


FCVT.W.S
rs1, frm^*
rd
NV, NX
^*if rm=111


FCVT.WU.S
rs1, frm^*
rd
NV, NX
^*if rm=111


FMV.X.W
rs1
rd


FEQ.S
rs1, rs2
rd
NV


FLT.S
rs1, rs2
rd
NV


FLE.S
rs1, rs2
rd
NV


FCLASS.S
rs1
rd


FCVT.S.W
rs1, frm^*
rd
NX
^*if rm=111


FCVT.S.WU
rs1, frm^*
rd
NX
^*if rm=111


FMV.W.X
rs1
rd


RV64F Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


FCVT.L.S
rs1, frm^*
rd
NV, NX
^*if rm=111


FCVT.LU.S
rs1, frm^*
rd
NV, NX
^*if rm=111


FCVT.S.L
rs1, frm^*
rd
NX
^*if rm=111


FCVT.S.LU
rs1, frm^*
rd
NX
^*if rm=111


RV32D Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


FLD^†
rs1^A
rd


FSD
rs1^A, rs2^D


FMADD.D
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FMSUB.D
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FNMSUB.D
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FNMADD.D
rs1, rs2, rs3, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FADD.D
rs1, rs2, frm^*
rd
NV, OF, NX
^*if rm=111


FSUB.D
rs1, rs2, frm^*
rd
NV, OF, NX
^*if rm=111


FMUL.D
rs1, rs2, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FDIV.D
rs1, rs2, frm^*
rd
NV, DZ, OF, UF, NX
^*if rm=111


FSQRT.D
rs1, frm^*
rd
NV, NX
^*if rm=111


FSGNJ.D
rs1, rs2
rd


FSGNJN.D
rs1, rs2
rd


FSGNJX.D
rs1, rs2
rd


FMIN.D
rs1, rs2
rd
NV


FMAX.D
rs1, rs2
rd
NV


FCVT.S.D
rs1, frm^*
rd
NV, OF, UF, NX
^*if rm=111


FCVT.D.S
rs1
rd
NV


FEQ.D
rs1, rs2
rd
NV


FLT.D
rs1, rs2
rd
NV


FLE.D
rs1, rs2
rd
NV


FCLASS.D
rs1
rd


FCVT.W.D
rs1, frm^*
rd
NV, NX
^*if rm=111


FCVT.WU.D
rs1, frm^*
rd
NV, NX
^*if rm=111


FCVT.D.W
rs1
rd


FCVT.D.WU
rs1
rd


RV64D Standard Extension


Source
Destination
Accumulating


Registers
Registers
CSRs


FCVT.L.D
rs1, frm^*
rd
NV, NX
^*if rm=111


FCVT.LU.D
rs1, frm^*
rd
NV, NX
^*if rm=111


FMV.X.D
rs1
rd


FCVT.D.L
rs1, frm^*
rd
NX
^*if rm=111


FCVT.D.LU
rs1, frm^*
rd
NX
^*if rm=111


FMV.D.X
rs1
rd


“C” Standard Extension for Compressed Instructions, Version 2.0

This chapter describes the RISC-V standard compressed instruction-set
extension, named “C”, which reduces static and dynamic code size by
adding short 16-bit instruction encodings for common operations. The C
extension can be added to any of the base ISAs (RV32, RV64, RV128), and
we use the generic term “RVC” to cover any of these. Typically, 50%–60%
of the RISC-V instructions in a program can be replaced with RVC
instructions, resulting in a 25%–30% code-size reduction.
Overview

RVC uses a simple compression scheme that offers shorter 16-bit versions
of common 32-bit RISC-V instructions when:

the immediate or address offset is small, or
one of the registers is the zero register (x0), the ABI link register
(x1), or the ABI stack pointer ( x2), or
the destination register and the first source register are identical, or
the registers used are the 8 most popular ones.

The C extension is compatible with all other standard instruction
extensions. The C extension allows 16-bit instructions to be freely
intermixed with 32-bit instructions, with the latter now able to start
on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C
extension, no instructions can raise instruction-address-misaligned
exceptions.

Removing the 32-bit alignment constraint on the original 32-bit
instructions allows significantly greater code density.

The compressed instruction encodings are mostly common across RV32C,
RV64C, and RV128C, but as shown in
Table [rvcopcodemap], a few opcodes are used
for different purposes depending on base ISA. For example, the wider
address-space RV64C and RV128C variants require additional opcodes to
compress loads and stores of 64-bit integer values, while RV32C uses the
same opcodes to compress loads and stores of single-precision
floating-point values. Similarly, RV128C requires additional opcodes to
capture loads and stores of 128-bit integer values, while these same
opcodes are used for loads and stores of double-precision floating-point
values in RV32C and RV64C. If the C extension is implemented, the
appropriate compressed floating-point load and store instructions must
be provided whenever the relevant standard floating-point extension (F
and/or D) is also implemented. In addition, RV32C includes a compressed
jump and link instruction to compress short-range subroutine calls,
where the same opcode is used to compress ADDIW for RV64C and RV128C.

Double-precision loads and stores are a significant fraction of static
and dynamic instructions, hence the motivation to include them in the
RV32C and RV64C encoding.
Although single-precision loads and stores are not a significant source
of static or dynamic compression for benchmarks compiled for the
currently supported ABIs, for microcontrollers that only provide
hardware single-precision floating-point units and have an ABI that only
supports single-precision floating-point numbers, the single-precision
loads and stores will be used at least as frequently as double-precision
loads and stores in the measured benchmarks. Hence, the motivation to
provide compressed support for these in RV32C.
Short-range subroutine calls are more likely in small binaries for
microcontrollers, hence the motivation to include these in RV32C.
Although reusing opcodes for different purposes for different base ISAs
adds some complexity to documentation, the impact on implementation
complexity is small even for designs that support multiple base ISAs.
The compressed floating-point load and store variants use the same
instruction format with the same register specifiers as the wider
integer loads and stores.

RVC was designed under the constraint that each RVC instruction expands
into a single 32-bit instruction in either the base ISA (RV32I/E, RV64I,
or RV128I) or the F and D standard extensions where present. Adopting
this constraint has two main benefits:

Hardware designs can simply expand RVC instructions during decode,
simplifying verification and minimizing modifications to existing
microarchitectures.
Compilers can be unaware of the RVC extension and leave code compression
to the assembler and linker, although a compression-aware compiler will
generally be able to produce better results.


We felt the multiple complexity reductions of a simple one-one mapping
between C and base IFD instructions far outweighed the potential gains
of a slightly denser encoding that added additional instructions only
supported in the C extension, or that allowed encoding of multiple IFD
instructions in one C instruction.

It is important to note that the C extension is not designed to be a
stand-alone ISA, and is meant to be used alongside a base ISA.

Variable-length instruction sets have long been used to improve code
density. For example, the IBM Stretch , developed in the late 1950s, had
an ISA with 32-bit and 64-bit instructions, where some of the 32-bit
instructions were compressed versions of the full 64-bit instructions.
Stretch also employed the concept of limiting the set of registers that
were addressable in some of the shorter instruction formats, with short
branch instructions that could only refer to one of the index registers.
The later IBM 360 architecture  supported a simple variable-length
instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats.
In 1963, CDC introduced the Cray-designed CDC 6600 , a precursor to RISC
architectures, that introduced a register-rich load-store architecture
with instructions of two lengths, 15-bits and 30-bits. The later Cray-1
design used a very similar instruction format, with 16-bit and 32-bit
instruction lengths.
The initial RISC ISAs from the 1980s all picked performance over code
size, which was reasonable for a workstation environment, but not for
embedded systems. Hence, both ARM and MIPS subsequently made versions of
the ISAs that offered smaller code size by offering an alternative
16-bit wide instruction set instead of the standard 32-bit wide
instructions. The compressed RISC ISAs reduced code size relative to
their starting points by about 25–30%, yielding code that was
significantly smaller than 80x86. This result surprised some, as their
intuition was that the variable-length CISC ISA should be smaller than
RISC ISAs that offered only 16-bit and 32-bit formats.
Since the original RISC ISAs did not leave sufficient opcode space free
to include these unplanned compressed instructions, they were instead
developed as complete new ISAs. This meant compilers needed different
code generators for the separate compressed ISAs. The first compressed
RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed
16-bit instruction size, which gave good reductions in static code size
but caused an increase in dynamic instruction count, which led to lower
performance compared to the original fixed-width 32-bit instruction
size. This led to the development of a second generation of compressed
RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g.,
ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to
pure 32-bit instructions but with significant code size savings.
Unfortunately, these different generations of compressed ISAs are
incompatible with each other and with the original uncompressed ISA,
leading to significant complexity in documentation, implementations, and
software tools support.
Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently
supports a compressed instruction format. It is surprising that the most
popular 64-bit ISA for mobile platforms (ARM v8) does not include a
compressed instruction format given that static code size and dynamic
instruction fetch bandwidth are important metrics. Although static code
size is not a major concern in larger systems, instruction fetch
bandwidth can be a major bottleneck in servers running commercial
workloads, which often have a large instruction working set.
Benefiting from 25 years of hindsight, RISC-V was designed to support
compressed instructions from the outset, leaving enough opcode space for
RVC to be added as a simple extension on top of the base ISA (along with
many other extensions). The philosophy of RVC is to reduce code size for
embedded applications and to improve performance and energy-efficiency
for all applications due to fewer misses in the instruction cache.
Waterman shows that RVC fetches 25%-30% fewer instruction bits, which
reduces instruction cache misses by 20%-25%, or roughly the same
performance impact as doubling the instruction cache size .

Compressed Instruction Formats

Table 1.1 shows the nine compressed
instruction formats. CR, CI, and CSS can use any of the 32 RVI
registers, but CIW, CL, CS, CA, and CB are limited to just 8 of them.
Table 1.2 lists these popular registers, which
correspond to registers x8 to x15. Note that there is a separate
version of load and store instructions that use the stack pointer as the
base address register, since saving to and restoring from the stack are
so prevalent, and that they use the CI and CSS formats to allow access
to all 32 data registers. CIW supplies an 8-bit immediate for the
ADDI4SPN instruction.

The RISC-V ABI was changed to make the frequently used registers map to
registers x8–x15. This simplifies the decompression decoder by
having a contiguous naturally aligned set of register numbers, and is
also compatible with the RV32E base ISA, which only has 16 integer
registers.

Compressed register-based floating-point loads and stores also use the
CL and CS formats respectively, with the eight registers mapping to f8
to f15.

The standard RISC-V calling convention maps the most frequently used
floating-point registers to registers f8 to f15, which allows the
same register decompression decoding as for integer register numbers.

The formats were designed to keep bits for the two register source
specifiers in the same place in all instructions, while the destination
register field can move. When the full 5-bit destination register
specifier is present, it is in the same place as in the 32-bit RISC-V
encoding. Where immediates are sign-extended, the sign-extension is
always from bit 12. Immediate fields have been scrambled, as in the base
specification, to reduce the number of immediate muxes required.

The immediate fields are scrambled in the instruction formats instead of
in sequential order so that as many bits as possible are in the same
position in every instruction, thereby simplifying implementations.

For many RVC instructions, zero-valued immediates are disallowed and
x0 is not a valid 5-bit register specifier. These restrictions free up
encoding space for other instructions requiring fewer operand bits.


Format
Meaning


CR
Register
funct4


rd/rs1


rs2


op


CI
Immediate
funct3


imm
rd/rs1


imm


op


CSS
Stack-relative Store
funct3


imm


rs2


op


CIW
Wide Immediate
funct3


imm


rd ′


op


CL
Load
funct3


imm


rs1 ′


imm

rd ′


op


CS
Store
funct3


imm


rs1 ′


imm

rs2 ′


op


CA
Arithmetic
funct6


rd ′/rs1 ′


funct2

rs2 ′


op


CB
Branch/Arithmetic
funct3


offset


rd ′/rs1 ′


offset


op


CJ
Jump
funct3


jump target


op


Compressed 16-bit RVC instruction formats.


RVC Register Number
000
001
010
011
100
101
110
111


Integer Register Number
x8
x9
x10
x11
x12
x13
x14
x15


Integer Register ABI Name
s0
s1
a0
a1
a2
a3
a4
a5


Floating-Point Register Number
f8
f9
f10
f11
f12
f13
f14
f15


Floating-Point Register ABI Name
fs0
fs1
fa0
fa1
fa2
fa3
fa4
fa5


Registers specified by the three-bit rs1 ′, rs2 ′, and rd ′ fields
of the CIW, CL, CS, CA, and CB formats.


Load and Store Instructions

To increase the reach of 16-bit instructions, data-transfer instructions
use zero-extended immediates that are scaled by the size of the data in
bytes: ×4 for words, ×8 for double words, and
×16 for quad words.
RVC provides two variants of loads and stores. One uses the ABI stack
pointer, x2, as the base address and can target any data register. The
other can reference one of 8 base address registers and one of 8 data
registers.
Stack-Pointer-Based Loads and Stores


| S | W | T | T | Y

|:- |:- |:- |:-
| | | | |

| | | | |

| | 1 | 5 | 5 | 2

| C.LWSP | offset[5] | dest≠0 | offset[4:2|7:6] |
C2

| C.LDSP | offset[5] | dest≠0 | offset[4:3|8:6] |
C2

| C.LQSP | offset[5] | dest≠0 | offset[4|9:6] | C2

| C.FLWSP| offset[5] | dest | offset[4:2|7:6] | C2

| C.FLDSP| offset[5] | dest | offset[4:3|8:6] | C2


These instructions use the CI format.
C.LWSP loads a 32-bit value from memory into register rd. It computes
an effective address by adding the zero-extended offset, scaled by 4,
to the stack pointer, x2. It expands to lw rd, offset(x2). C.LWSP is
only valid when rd ≠ x0; the code points with rd = x0 are
reserved.
C.LDSP is an RV64C/RV128C-only instruction that loads a 64-bit value
from memory into register rd. It computes its effective address by
adding the zero-extended offset, scaled by 8, to the stack pointer,
x2. It expands to ld rd, offset(x2). C.LDSP is only valid when
rd ≠ x0; the code points with rd = x0 are reserved.
C.LQSP is an RV128C-only instruction that loads a 128-bit value from
memory into register rd. It computes its effective address by adding
the zero-extended offset, scaled by 16, to the stack pointer, x2. It
expands to lq rd, offset(x2). C.LQSP is only valid when rd ≠ x0;
the code points with rd = x0 are reserved.
C.FLWSP is an RV32FC-only instruction that loads a single-precision
floating-point value from memory into floating-point register rd. It
computes its effective address by adding the zero-extended offset,
scaled by 4, to the stack pointer, x2. It expands to
flw rd, offset(x2).
C.FLDSP is an RV32DC/RV64DC-only instruction that loads a
double-precision floating-point value from memory into floating-point
register rd. It computes its effective address by adding the
zero-extended offset, scaled by 8, to the stack pointer, x2. It
expands to fld rd, offset(x2).


| S | M | T | Y

|:- |:- |:-
| | | |

| | | |

| | 6 | 5 | 2

| C.SWSP | offset[5:2|7:6] | src | C2

| C.SDSP | offset[5:3|8:6] | src | C2

| C.SQSP | offset[5:4|9:6] | src | C2

| C.FSWSP| offset[5:2|7:6] | src | C2

| C.FSDSP| offset[5:3|8:6] | src | C2


These instructions use the CSS format.
C.SWSP stores a 32-bit value in register rs2 to memory. It computes an
effective address by adding the zero-extended offset, scaled by 4, to
the stack pointer, x2. It expands to sw rs2, offset(x2).
C.SDSP is an RV64C/RV128C-only instruction that stores a 64-bit value in
register rs2 to memory. It computes an effective address by adding the
zero-extended offset, scaled by 8, to the stack pointer, x2. It
expands to sd rs2, offset(x2).
C.SQSP is an RV128C-only instruction that stores a 128-bit value in
register rs2 to memory. It computes an effective address by adding the
zero-extended offset, scaled by 16, to the stack pointer, x2. It
expands to sq rs2, offset(x2).
C.FSWSP is an RV32FC-only instruction that stores a single-precision
floating-point value in floating-point register rs2 to memory. It
computes an effective address by adding the zero-extended offset,
scaled by 4, to the stack pointer, x2. It expands to
fsw rs2, offset(x2).
C.FSDSP is an RV32DC/RV64DC-only instruction that stores a
double-precision floating-point value in floating-point register rs2
to memory. It computes an effective address by adding the
zero-extended offset, scaled by 8, to the stack pointer, x2. It
expands to fsd rs2, offset(x2).

Register save/restore code at function entry/exit represents a
significant portion of static code size. The stack-pointer-based
compressed loads and stores in RVC are effective at reducing the
save/restore static code size by a factor of 2 while improving
performance by reducing dynamic instruction bandwidth.
A common mechanism used in other ISAs to further reduce save/restore
code size is load-multiple and store-multiple instructions. We
considered adopting these for RISC-V but noted the following drawbacks
to these instructions:


These instructions complicate processor implementations.


For virtual memory systems, some data accesses could be resident in
physical memory and some could not, which requires a new restart
mechanism for partially executed instructions.


Unlike the rest of the RVC instructions, there is no IFD equivalent
to Load Multiple and Store Multiple.


Unlike the rest of the RVC instructions, the compiler would have to
be aware of these instructions to both generate the instructions and
to allocate registers in an order to maximize the chances of the
them being saved and stored, since they would be saved and restored
in sequential order.


Simple microarchitectural implementations will constrain how other
instructions can be scheduled around the load and store multiple
instructions, leading to a potential performance loss.


The desire for sequential register allocation might conflict with
the featured registers selected for the CIW, CL, CS, CA, and CB
formats.


Furthermore, much of the gains can be realized in software by replacing
prologue and epilogue code with subroutine calls to common prologue and
epilogue code, a technique described in Section 5.6 of .
While reasonable architects might come to different conclusions, we
decided to omit load and store multiple and instead use the
software-only approach of calling save/restore millicode routines to
attain the greatest code size reduction.

Register-Based Loads and Stores


| S | S | S | Y | S | Y

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 3 | 3 | 2 | 3 | 2

| C.LW | offset[5:3] | base | offset[2|6] | dest | C0

| C.LD | offset[5:3] | base | offset[7:6] | dest | C0

| C.LQ | offset[5|4|8] | base | offset[7:6] | dest |
C0

| C.FLW| offset[5:3] | base | offset[2|6] | dest | C0

| C.FLD| offset[5:3] | base | offset[7:6] | dest | C0


These instructions use the CL format.
C.LW loads a 32-bit value from memory into register rd ′. It computes
an effective address by adding the zero-extended offset, scaled by 4,
to the base address in register rs1 ′. It expands to
lw rd ', offset(rs1 ').
C.LD is an RV64C/RV128C-only instruction that loads a 64-bit value from
memory into register rd ′. It computes an effective address by adding
the zero-extended offset, scaled by 8, to the base address in register
rs1 ′. It expands to ld rd ', offset(rs1 ').
C.LQ is an RV128C-only instruction that loads a 128-bit value from
memory into register rd ′. It computes an effective address by adding
the zero-extended offset, scaled by 16, to the base address in
register rs1 ′. It expands to lq rd ', offset(rs1 ').
C.FLW is an RV32FC-only instruction that loads a single-precision
floating-point value from memory into floating-point register rd ′. It
computes an effective address by adding the zero-extended offset,
scaled by 4, to the base address in register rs1 ′. It expands to
flw rd ', offset(rs1 ').
C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision
floating-point value from memory into floating-point register rd ′. It
computes an effective address by adding the zero-extended offset,
scaled by 8, to the base address in register rs1 ′. It expands to
fld rd ', offset(rs1 ').


| S | S | S | Y | S | Y

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 3 | 3 | 2 | 3 | 2

| C.SW | offset[5:3] | base | offset[2|6] | src | C0

| C.SD | offset[5:3] | base | offset[7:6] | src | C0

| C.SQ | offset[5|4|8] | base | offset[7:6] | src |
C0

| C.FSW| offset[5:3] | base | offset[2|6] | src | C0

| C.FSD| offset[5:3] | base | offset[7:6] | src | C0


These instructions use the CS format.
C.SW stores a 32-bit value in register rs2 ′ to memory. It computes an
effective address by adding the zero-extended offset, scaled by 4, to
the base address in register rs1 ′. It expands to
sw rs2 ', offset(rs1 ').
C.SD is an RV64C/RV128C-only instruction that stores a 64-bit value in
register rs2 ′ to memory. It computes an effective address by adding
the zero-extended offset, scaled by 8, to the base address in register
rs1 ′. It expands to sd rs2 ', offset(rs1 ').
C.SQ is an RV128C-only instruction that stores a 128-bit value in
register rs2 ′ to memory. It computes an effective address by adding
the zero-extended offset, scaled by 16, to the base address in
register rs1 ′. It expands to sq rs2 ', offset(rs1 ').
C.FSW is an RV32FC-only instruction that stores a single-precision
floating-point value in floating-point register rs2 ′ to memory. It
computes an effective address by adding the zero-extended offset,
scaled by 4, to the base address in register rs1 ′. It expands to
fsw rs2 ', offset(rs1 ').
C.FSD is an RV32DC/RV64DC-only instruction that stores a
double-precision floating-point value in floating-point register rs2 ′
to memory. It computes an effective address by adding the
zero-extended offset, scaled by 8, to the base address in register
rs1 ′. It expands to fsd rs2 ', offset(rs1 ').
Control Transfer Instructions

RVC provides unconditional jump instructions and conditional branch
instructions. As with base RVI instructions, the offsets of all RVC
control transfer instructions are in multiples of 2 bytes.


| S | L | Y

|:- |:-
| | |

| | |

| | 11 | 2

| C.J |
offset[11|4|9:8|10|6|7|3:1|5]
| | C1

| C.JAL |
offset[11|4|9:8|10|6|7|3:1|5]
| | C1


These instructions use the CJ format.
C.J performs an unconditional control transfer. The offset is
sign-extended and added to the pc to form the jump target address. C.J
can therefore target a ± range. C.J expands to jal x0, offset.
C.JAL is an RV32C-only instruction that performs the same operation as
C.J, but additionally writes the address of the instruction following
the jump (pc+2) to the link register, x1. C.JAL expands to
jal x1, offset.


| E | T | T | Y

|:- |:- |:-
| | | |

| | | |

| | 5 | 5 | 2

| C.JR | src≠0 | 0 | C2

| C.JALR | src≠0 | 0 | C2


These instructions use the CR format.
C.JR (jump register) performs an unconditional control transfer to the
address in register rs1. C.JR expands to jalr x0, 0(rs1). C.JR is
only valid when rs1 ≠ x0; the code point with rs1 = x0 is
reserved.
C.JALR (jump and link register) performs the same operation as C.JR, but
additionally writes the address of the instruction following the jump
(pc+2) to the link register, x1. C.JALR expands to
jalr x1, 0(rs1). C.JALR is only valid when rs1 ≠ x0; the code
point with rs1 = x0 corresponds to the C.EBREAK instruction.

Strictly speaking, C.JALR does not expand exactly to a base RVI
instruction as the value added to the pc to form the link address is 2
rather than 4 as in the base ISA, but supporting both offsets of 2 and 4
bytes is only a very minor change to the base microarchitecture.


| S | S | S | T | Y

|:- |:- |:- |:-
| | | | |

| | | | |

| | 3 | 3 | 5 | 2

| C.BEQZ | offset[8|4:3] | src |
| offset[7:6|2:1|5] | C1

| C.BNEZ | offset[8|4:3] | src |
| offset[7:6|2:1|5] | C1


These instructions use the CB format.
C.BEQZ performs conditional control transfers. The offset is
sign-extended and added to the pc to form the branch target address.
It can therefore target a ± range. C.BEQZ takes the branch if the value
in register rs1 ′ is zero. It expands to beq rs1 ', x0, offset.
C.BNEZ is defined analogously, but it takes the branch if rs1 ′
contains a nonzero value. It expands to bne rs1 ', x0, offset.
Integer Computational Instructions

RVC provides several instructions for integer arithmetic and constant
generation.
Integer Constant-Generation Instructions

The two constant-generation instructions both use the CI instruction
format and can target any integer register.


| S | W | T | T | Y

|:- |:- |:- |:-
| | | | |

| | | | |

| | 1 | 5 | 5 | 2

| C.LI | imm[5] | dest≠0 | imm[4:0] | C1

| C.LUI | nzimm[17] | dest ≠ {0,2} | nzimm[16:12] | C1


C.LI loads the sign-extended 6-bit immediate, imm, into register rd.
C.LI expands into addi rd, x0, imm. C.LI is only valid when rd≠x0;
the code points with rd=x0 encode HINTs.
C.LUI loads the non-zero 6-bit immediate field into bits 17–12 of the
destination register, clears the bottom 12 bits, and sign-extends bit 17
into all higher bits of the destination. C.LUI expands into
lui rd, nzimm. C.LUI is only valid when rd ≠ {x0,x2}, and when
the immediate is not equal to zero. The code points with nzimm=0 are
reserved; the remaining code points with rd=x0 are HINTs; and the
remaining code points with rd=x2 correspond to the C.ADDI16SP
instruction.
Integer Register-Immediate Operations

These integer register-immediate operations are encoded in the CI format
and perform operations on an integer register and a 6-bit immediate.


| S | W | T | T | Y

|:- |:- |:- |:-
| | | | |

| | | | |

| | 1 | 5 | 5 | 2

| C.ADDI | nzimm[5] | dest≠0 | nzimm[4:0] | C1

| C.ADDIW | imm[5] | dest≠0 | imm[4:0] | C1

| C.ADDI16SP | nzimm[9] | 2 |
| nzimm[4|6|8:7|5] | C1


C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in
register rd then writes the result to rd. C.ADDI expands into
addi rd, rd, nzimm. C.ADDI is only valid when rd≠x0 and
nzimm≠0. The code points with rd=x0 encode the C.NOP
instruction; the remaining code points with nzimm=0 encode HINTs.
C.ADDIW is an RV64C/RV128C-only instruction that performs the same
computation but produces a 32-bit result, then sign-extends result to 64
bits. C.ADDIW expands into addiw rd, rd, imm. The immediate can be
zero for C.ADDIW, where this corresponds to  sext.w rd. C.ADDIW is
only valid when rd≠x0; the code points with rd=x0 are reserved.
C.ADDI16SP shares the opcode with C.LUI, but has a destination field of
x2. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to the
value in the stack pointer (sp=x2), where the immediate is scaled to
represent multiples of 16 in the range (-512,496). C.ADDI16SP is used to
adjust the stack pointer in procedure prologues and epilogues. It
expands into addi x2, x2, nzimm. C.ADDI16SP is only valid when
nzimm≠0; the code point with nzimm=0 is reserved.

In the standard RISC-V calling convention, the stack pointer sp is
always 16-byte aligned.


|  | S | K | S | Y

|:- |:- |:- |:-
| | | |

| | | |

| | 8 | 3 | 2

| C.ADDI4SPN | nzuimm[5:4|9:6|2|3] | dest |
C0


C.ADDI4SPN is a CIW-format instruction that adds a zero-extended
non-zero immediate, scaled by 4, to the stack pointer, x2, and writes
the result to rd '. This instruction is used to generate pointers to
stack-allocated variables, and expands to addi rd ', x2, nzuimm.
C.ADDI4SPN is only valid when nzuimm≠0; the code points with
nzuimm=0 are reserved.


| S | W | T | T | Y

|:- |:- |:- |:-
| | | | |

| | | | |

| | 1 | 5 | 5 | 2

| C.SLLI | shamt[5] | dest≠0 | shamt[4:0] | C2


C.SLLI is a CI-format instruction that performs a logical left shift of
the value in register rd then writes the result to rd. The shift
amount is encoded in the shamt field. For RV128C, a shift amount of
zero is used to encode a shift of 64. C.SLLI expands into
slli rd, rd, shamt, except for RV128C with shamt=0, which expands to
slli rd, rd, 64.
For RV32C, shamt[5] must be zero; the code points with
shamt[5]=1 are designated for custom extensions. For RV32C and
RV64C, the shift amount must be non-zero; the code points with shamt=0
are HINTs. For all base ISAs, the code points with rd=x0 are HINTs,
except those with shamt[5]=1 in RV32C.


| S | W | Y | S | T | Y

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 1 | 2 | 3 | 5 | 2

| C.SRLI | shamt[5] | C.SRLI | dest | shamt[4:0] | C1

| C.SRAI | shamt[5] | C.SRAI | dest | shamt[4:0] | C1


C.SRLI is a CB-format instruction that performs a logical right shift of
the value in register rd ′ then writes the result to rd ′. The shift
amount is encoded in the shamt field. For RV128C, a shift amount of
zero is used to encode a shift of 64. Furthermore, the shift amount is
sign-extended for RV128C, and so the legal shift amounts are 1–31, 64,
and 96–127. C.SRLI expands into srli rd ', rd ', shamt, except for
RV128C with shamt=0, which expands to srli rd ', rd ', 64.
For RV32C, shamt[5] must be zero; the code points with
shamt[5]=1 are designated for custom extensions. For RV32C and
RV64C, the shift amount must be non-zero; the code points with shamt=0
are HINTs.
C.SRAI is defined analogously to C.SRLI, but instead performs an
arithmetic right shift. C.SRAI expands to srai rd ', rd ', shamt.

Left shifts are usually more frequent than right shifts, as left shifts
are frequently used to scale address values. Right shifts have therefore
been granted less encoding space and are placed in an encoding quadrant
where all other immediates are sign-extended. For RV128, the decision
was made to have the 6-bit shift-amount immediate also be sign-extended.
Apart from reducing the decode complexity, we believe right-shift
amounts of 96–127 will be more useful than 64–95, to allow extraction of
tags located in the high portions of 128-bit address pointers. We note
that RV128C will not be frozen at the same point as RV32C and RV64C, to
allow evaluation of typical usage of 128-bit address-space codes.


| S | W | Y | S | T | Y

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 1 | 2 | 3 | 5 | 2

| C.ANDI | imm[5] | C.ANDI | dest | imm[4:0] | C1


C.ANDI is a CB-format instruction that computes the bitwise AND of the
value in register rd ′ and the sign-extended 6-bit immediate, then
writes the result to rd ′. C.ANDI expands to andi rd ', rd ', imm.
Integer Register-Register Operations


| E | T | T | Y

|:- |:- |:-
| | | |

| | | |

| | 5 | 5 | 2

| C.MV | dest≠0 | src≠0 | C2

| C.ADD | dest≠0 | src≠0 | C2


These instructions use the CR format.
C.MV copies the value in register rs2 into register rd. C.MV expands
into add rd, x0, rs2. C.MV is only valid when rs2 ≠ x0; the code
points with rs2 = x0 correspond to the C.JR instruction. The code
points with rs2 ≠ x0 and rd = x0 are HINTs.

C.MV expands to a different instruction than the canonical MV
pseudoinstruction, which instead uses ADDI. Implementations that handle
MV specially, e.g. using register-renaming hardware, may find it more
convenient to expand C.MV to MV instead of ADD, at slight additional
hardware cost.

C.ADD adds the values in registers rd and rs2 and writes the result
to register rd. C.ADD expands into add rd, rd, rs2. C.ADD is only
valid when rs2 ≠ x0; the code points with rs2 = x0 correspond to
the C.JALR and C.EBREAK instructions. The code points with rs2 ≠ x0
and rd = x0 are HINTs.


| M | S | Y | S | Y

|:- |:- |:- |:-
| | | | |

| | | | |

| | 3 | 2 | 3 | 2

| C.AND | dest | C.AND | src | C1

| C.OR | dest | C.OR | src | C1

| C.XOR | dest | C.XOR | src | C1

| C.SUB | dest | C.SUB | src | C1

| C.ADDW | dest | C.ADDW | src | C1

| C.SUBW | dest | C.SUBW | src | C1


These instructions use the CA format.
C.AND computes the bitwise AND of the values in registers rd ′ and
rs2 ′, then writes the result to register rd ′. C.AND expands into
and rd ', rd ', rs2 '.
C.OR computes the bitwise OR of the values in registers rd ′ and
rs2 ′, then writes the result to register rd ′. C.OR expands into
or rd ', rd ', rs2 '.
C.XOR computes the bitwise XOR of the values in registers rd ′ and
rs2 ′, then writes the result to register rd ′. C.XOR expands into
xor rd ', rd ', rs2 '.
C.SUB subtracts the value in register rs2 ′ from the value in register
rd ′, then writes the result to register rd ′. C.SUB expands into
sub rd ', rd ', rs2 '.
C.ADDW is an RV64C/RV128C-only instruction that adds the values in
registers rd ′ and rs2 ′, then sign-extends the lower 32 bits of the
sum before writing the result to register rd ′. C.ADDW expands into
addw rd ', rd ', rs2 '.
C.SUBW is an RV64C/RV128C-only instruction that subtracts the value in
register rs2 ′ from the value in register rd ′, then sign-extends
the lower 32 bits of the difference before writing the result to
register rd ′. C.SUBW expands into subw rd ', rd ', rs2 '.

This group of six instructions do not provide large savings
individually, but do not occupy much encoding space and are
straightforward to implement, and as a group provide a worthwhile
improvement in static and dynamic compression.

Defined Illegal Instruction


| SW | T | T | Y

|:- |:- |:-
| | | | |

| | | | |

| | 1 | 5 | 5 | 2

| 0 | 0 | 0 | 0 | 0


A 16-bit instruction with all bits zero is permanently reserved as an
illegal instruction.

We reserve all-zero instructions to be illegal instructions to help trap
attempts to execute zero-ed or non-existent portions of the memory
space. The all-zero value should not be redefined in any non-standard
extension. Similarly, we reserve instructions with all bits set to 1
(corresponding to very long instructions in the RISC-V variable-length
encoding scheme) as illegal to capture another common value seen in
non-existent memory regions.

NOP Instruction


| SW | T | T | Y

|:- |:- |:-
| | | | |

| | | | |

| | 1 | 5 | 5 | 2

| C.NOP | 0 | 0 | 0 | C1


C.NOP is a CI-format instruction that does not change any user-visible
state, except for advancing the pc and incrementing any applicable
performance counters. C.NOP expands to nop. C.NOP is only valid when
imm=0; the code points with imm≠0 encode HINTs.
Breakpoint Instruction


| E | U | Y

|:- |:-
| | |

| | |

| | 10 | 2

| C.EBREAK | 0 | C2


Debuggers can use the C.EBREAK instruction, which expands to ebreak,
to cause control to be transferred back to the debugging environment.
C.EBREAK shares the opcode with the C.ADD instruction, but with rd and
rs2 both zero, thus can also use the CR format.
Usage of C Instructions in LR/SC Sequences

On implementations that support the C extension, compressed forms of the
I instructions permitted inside constrained LR/SC sequences, as
described in Section [sec:lrscseq], are also permitted
inside constrained LR/SC sequences.

The implication is that any implementation that claims to support both
the A and C extensions must ensure that LR/SC sequences containing valid
C instructions will eventually complete.

HINT Instructions

A portion of the RVC encoding space is reserved for microarchitectural
HINTs. Like the HINTs in the RV32I base ISA (see
Section [sec:rv32i-hints]), these
instructions do not modify any architectural state, except for advancing
the pc and any applicable performance counters. HINTs are executed as
no-ops on implementations that ignore them.
RVC HINTs are encoded as computational instructions that do not modify
the architectural state, either because rd=x0 (e.g.
C.ADD x0, t0), or because rd is overwritten with a copy of itself
(e.g. C.ADDI t0, 0).

This HINT encoding has been chosen so that simple implementations can
ignore HINTs altogether, and instead execute a HINT as a regular
computational instruction that happens not to mutate the architectural
state.

RVC HINTs do not necessarily expand to their RVI HINT counterparts. For
example, C.ADD x0, a0 might not encode the same HINT as
ADD x0, x0, a0.

The primary reason to not require an RVC HINT to expand to an RVI HINT
is that HINTs are unlikely to be compressible in the same manner as the
underlying computational instruction. Also, decoupling the RVC and RVI
HINT mappings allows the scarce RVC HINT space to be allocated to the
most popular HINTs, and in particular, to HINTs that are amenable to
macro-op fusion.

Table 1.3 lists all RVC HINT code points.
For RV32C, 78% of the HINT space is reserved for standard HINTs. The
remainder of the HINT space is designated for custom HINTs: no standard
HINTs will ever be defined in this subspace.


Instruction
Constraints
Code Points
Purpose


C.NOP
nzimm≠0
63
Reserved for future standard use


C.ADDI
rd≠x0, nzimm=0
31


C.LI
rd=x0
64


C.LUI
rd=x0, nzimm≠0
63


C.MV
rd=x0, rs2≠x0
31


C.ADD
rd=x0, rs2≠x0, rs2≠x2–x5
27


C.ADD
rd=x0, rs2=x2–x5
4
(rs2=x2) C.NTL.P1


(rs2=x3) C.NTL.PALL


(rs2=x4) C.NTL.S1


(rs2=x5) C.NTL.ALL


C.SLLI
rd=x0, nzimm≠0
31 (RV32)
Designated for custom use


63 (RV64/128)


C.SLLI64
rd=x0
1


C.SLLI64
rd≠x0, RV32 and RV64 only
31


C.SRLI64
RV32 and RV64 only
8


C.SRAI64
RV32 and RV64 only
8


RVC HINT instructions.

RVC Instruction Set Listings

Table [rvcopcodemap] shows a map of the
major opcodes for RVC. Each row of the table corresponds to one quadrant
of the encoding space. The last quadrant, which has the two
least-significant bits set, corresponds to instructions wider than 16
bits, including those in the base ISAs. Several instructions are only
valid for certain operands; when invalid, they are marked either RES
to indicate that the opcode is reserved for future standard extensions;
Custom to indicate that the opcode is designated for custom
extensions; or HINT to indicate that the opcode is reserved for
microarchitectural hints (see
Section 1.7).
Tables [rvc-instr-table0]–[rvc-instr-table2] list the RVC
instructions.


inst[15:13]
000
001
010
011
100
101
110
111


inst[1:0]


00
ADDI4SPN
FLD
LW
FLW
Reserved
FSD
SW
FSW
RV32


FLD

LD

FSD

SD
RV64


LQ

LD

SQ

SD
RV128


01
ADDI
JAL
LI
LUI/ADDI16SP
MISC-ALU
J
BEQZ
BNEZ
RV32


ADDIW


RV64


ADDIW


RV128


10
SLLI
FLDSP
LWSP
FLWSP
J[AL]R/MV/ADD
FSDSP
SWSP
FSWSP
RV32


FLDSP

LDSP

FSDSP

SDSP
RV64


LQSP

LDSP

SQSP

SDSP
RV128


11
>16b


000


0


0


00

Illegal instruction


000


nzuimm[5:4|9:6|2|3]


00

C.ADDI4SPN (RES, nzuimm=0)


001


uimm[5:3]


uimm[7:6]


00

C.FLD (RV32/64)


001


uimm[5:4|8]


uimm[7:6]


00

C.LQ (RV128)


010


uimm[5:3]


uimm[2|6]


00

C.LW


011


uimm[5:3]


uimm[2|6]


00

C.FLW (RV32)


011


uimm[5:3]


uimm[7:6]


00

C.LD (RV64/128)


100


—


00

Reserved


101


uimm[5:3]


uimm[7:6]


00

C.FSD (RV32/64)


101


uimm[5:4|8]


uimm[7:6]


00

C.SQ (RV128)


110


uimm[5:3]


uimm[2|6]


00

C.SW


111


uimm[5:3]


uimm[2|6]


00

C.FSW (RV32)


111


uimm[5:3]


uimm[7:6]


00

C.SD (RV64/128)


Instruction listing for RVC, Quadrant 0.


000


nzimm[5]
0


nzimm[4:0]


01

C.NOP (HINT, nzimm≠0)


000


nzimm[5]
rs1/rd≠0


nzimm[4:0]


01

C.ADDI (HINT, nzimm=0)


001


imm[11|4|9:8|10|6|7|3:1|5]


01

C.JAL (RV32)


001


imm[5]
rs1/rd≠0


imm[4:0]


01

C.ADDIW (RV64/128; RES, rd=0)


010


imm[5]
rd≠0


imm[4:0]


01

C.LI (HINT, rd=0)


011


nzimm[9]
2


nzimm[4|6|8:7|5]


01

C.ADDI16SP (RES, nzimm=0)


011


nzimm[17]
rd≠{0, 2}


nzimm[16:12]


01

C.LUI (RES, nzimm=0; HINT, rd=0)


100


nzuimm[5]
00

/


nzuimm[4:0]


01

C.SRLI (RV32 Custom, nzuimm[5]=1)


100


0
00

/


0


01

C.SRLI64 (RV128; RV32/64 HINT)


100


nzuimm[5]
01

/


nzuimm[4:0]


01

C.SRAI (RV32 Custom, nzuimm[5]=1)


100


0
01

/


0


01

C.SRAI64 (RV128; RV32/64 HINT)


100


imm[5]
10

/


imm[4:0]


01

C.ANDI


100


0
11

/


00


01

C.SUB


100


0
11

/


01


01

C.XOR


100


0
11

/


10


01

C.OR


100


0
11

/


11


01

C.AND


100


1
11

/


00


01

C.SUBW (RV64/128; RV32 RES)


100


1
11

/


01


01

C.ADDW (RV64/128; RV32 RES)


100


1
11

—


10

—


01

Reserved


100


1
11

—


11

—


01

Reserved


101


imm[11|4|9:8|10|6|7|3:1|5]


01

C.J


110


imm[8|4:3]


imm[7:6|2:1|5]


01

C.BEQZ


111


imm[8|4:3]


imm[7:6|2:1|5]


01

C.BNEZ


Instruction listing for RVC, Quadrant 1.


000


nzuimm[5]
rs1/rd≠0


nzuimm[4:0]


10

C.SLLI (HINT, rd=0; RV32 Custom, nzuimm[5]=1)


000


0
rs1/rd≠0


0


10

C.SLLI64 (RV128; RV32/64 HINT; HINT, rd=0)


001


uimm[5]
rd


uimm[4:3|8:6]


10

C.FLDSP (RV32/64)


001


uimm[5]
rd≠0


uimm[4|9:6]


10

C.LQSP (RV128; RES, rd=0)


010


uimm[5]
rd≠0


uimm[4:2|7:6]


10

C.LWSP (RES, rd=0)


011


uimm[5]
rd


uimm[4:2|7:6]


10

C.FLWSP (RV32)


011


uimm[5]
rd≠0


uimm[4:3|8:6]


10

C.LDSP (RV64/128; RES, rd=0)


100


0
rs1≠0


0


10

C.JR (RES, rs1=0)


100


0
rd≠0


rs2≠0


10

C.MV (HINT, rd=0)


100


1
0


0


10

C.EBREAK


100


1
rs1≠0


0


10

C.JALR


100


1
rs1/rd≠0


rs2≠0


10

C.ADD (HINT, rd=0)


101


uimm[5:3|8:6]


rs2


10

C.FSDSP (RV32/64)


101


uimm[5:4|9:6]


rs2


10

C.SQSP (RV128)


110


uimm[5:2|7:6]


rs2


10

C.SWSP


111


uimm[5:2|7:6]


rs2


10

C.FSWSP (RV32)


111


uimm[5:3|8:6]


rs2


10

C.SDSP (RV64/128)


Instruction listing for RVC, Quadrant 2.


# “B” Standard Extension for Bit Manipulation, Version 0.0
This chapter is a placeholder for a future standard extension to provide
bit manipulation instructions, including instructions to insert,
extract, and test bit fields, and for rotations, funnel shifts, and bit
and byte permutations.

Although bit manipulation instructions are very effective in some
application domains, particularly when dealing with externally packed
data structures, we excluded them from the base ISAs as they are not
useful in all domains and can add additional complexity or instruction
formats to supply all needed operands.
We anticipate the B extension will be a brownfield encoding within the
base 30-bit instruction space.

# “J” Standard Extension for Dynamically Translated Languages, Version 0.0
This chapter is a placeholder for a future standard extension to support
dynamically translated languages.

Many popular languages are usually implemented via dynamic translation,
including Java and Javascript. These languages can benefit from
additional ISA support for dynamic checks and garbage collection.

# “P” Standard Extension for Packed-SIMD Instructions, Version 0.2

Discussions at the 5th RISC-V workshop indicated a desire to drop this
packed-SIMD proposal for floating-point registers in favor of
standardizing on the V extension for large floating-point SIMD
operations. However, there was interest in packed-SIMD fixed-point
operations for use in the integer registers of small RISC-V
implementations. A task group is working to define the new P extension.

# “V” Standard Extension for Vector Operations, Version 0.7
The current working group draft is hosted at
 https://github.com/riscv/riscv-v-spec.

The base vector extension is intended to provide general support for
data-parallel execution within the 32-bit instruction encoding space,
with later vector extensions supporting richer functionality for certain
domains.

# “Zam” Standard Extension for Misaligned Atomics, v0.1
This chapter defines the “Zam” extension, which extends the “A”
extension by standardizing support for misaligned atomic memory
operations (AMOs). On platforms implementing “Zam”, misaligned AMOs need
only execute atomically with respect to other accesses (including
non-atomic loads and stores) to the same address and of the same size.
More precisely, execution environments implementing “Zam” are subject to
the following axiom:
Atomicity Axiom for misaligned atomics

If r and w are paired misaligned load and store instructions from a
hart h with the same address and of the same size, then there can be
no store instruction s from a hart other than h with the same
address and of the same size as r and w such that a store operation
generated by s lies in between memory operations generated by r and
w in the global memory order. Furthermore, there can be no load
instruction l from a hart other than h with the same address and of
the same size as r and w such that a load operation generated by l
lies between two memory operations generated by r or by w in the
global memory order.
This restricted form of atomicity is intended to balance the needs of
applications which require support for misaligned atomics and the
ability of the implementation to actually provide the necessary degree
of atomicity.
Aligned instructions under “Zam” continue to behave as they normally do
under RVWMO.

The intention of “Zam” is that it can be implemented in one of two ways:


On hardware that natively supports atomic misaligned accesses to the
address and size in question (e.g., for misaligned accesses within a
single cache line): by simply following the same rules that would be
applied for aligned AMOs.


On hardware that does not natively support misaligned accesses to
the address and size in question: by trapping on all instructions
(including loads) with that address and size and executing them (via
any number of memory operations) inside a mutex that is a function
of the given memory address and access size. AMOs may be emulated by
splitting them into separate load and store operations, but all
preserved program order rules (e.g., incoming and outgoing syntactic
dependencies) must behave as if the AMO is still a single memory
operation.


# “Zfinx”, “Zdinx”, “Zhinx”, “Zhinxmin”: Standard Extensions for Floating-Point in Integer Registers, Version 1.0
This chapter defines the “Zfinx” extension (pronounced “z-f-in-x”) that
provides instructions similar to those in the standard floating-point F
extension for single-precision floating-point instructions but which
operate on the x registers instead of the f registers. This chapter
also defines the “Zdinx”, “Zhinx”, and “Zhinxmin” extensions that
provide similar instructions for other floating-point precisions.

The F extension uses separate f registers for floating-point
computation, to reduce register pressure and simplify the provision of
register-file ports for wide superscalars. However, the additional of
architectural state increases the minimal implementation cost. By
eliminating the f registers, the Zfinx extension substantially reduces
the cost of simple RISC-V implementations with floating-point
instruction-set support. Zfinx also reduces context-switch cost.
In general, software that assumes the presence of the F extension is
incompatible with software that assumes the presence of the Zfinx
extension, and vice versa.

The Zfinx extension adds all of the instructions that the F extension
adds, except for the transfer instructions FLW, FSW, FMV.W.X, FMV.X.W,
C.FLW[SP], and C.FSW[SP].

Zfinx software uses integer loads and stores to transfer floating-point
values from and to memory. Transfers between registers use either
integer arithmetic or floating-point sign-injection instructions.

The Zfinx variants of these F-extension instructions have the same
semantics, except that whenever such an instruction would have accessed
an f register, it instead accesses the x register with the same
number.
Processing of Narrower Values

Floating-point operands of width w < XLEN bits occupy bits w-1:0 of
an x register. Floating-point operations on w-bit operands ignore
operand bits XLEN-1:w.
Floating-point operations that produce w < XLEN-bit results fill bits
XLEN-1:w with copies of bit w-1 (the sign bit).

The NaN-boxing scheme employed in the f registers was designed to
efficiently support recoded floating-point formats. Recoding is less
practical for Zfinx, though, since the same registers hold both
floating-point and integer operands. Hence, the need for NaN boxing is
diminished.
Sign-extending 32-bit floating-point numbers when held in RV64 x
registers matches the existing RV64 calling conventions, which require
all 32-bit types to be sign-extended when passed or returned in x
registers. To keep the architecture more regular, we extend this pattern
to 16-bit floating-point numbers in both RV32 and RV64.

Zdinx

The Zdinx extension provides analogous double-precision floating-point
instructions. The Zdinx extension requires the Zfinx extension.
The Zdinx extension adds all of the instructions that the D extension
adds, except for the transfer instructions FLD, FSD, FMV.D.X, FMV.X.D,
C.FLD[SP], and C.FSD[SP].
The Zdinx variants of these D-extension instructions have the same
semantics, except that whenever such an instruction would have accessed
an f register, it instead accesses the x register with the same
number.
Processing of Wider Values

Double-precision operands in RV32Zdinx are held in aligned x-register
pairs, i.e., register numbers must be even. Use of misaligned
(odd-numbered) registers for double-width floating-point operands is
reserved.
Regardless of endianness, the lower-numbered register holds the
low-order bits, and the higher-numbered register holds the high-order
bits: e.g., bits 31:0 of a double-precision operand in RV32Zdinx might
be held in register x14, with bits 63:32 of that operand held in
x15.
When a double-width floating-point result is written to x0, the entire
write takes no effect: e.g., for RV32Zdinx, writing a double-precision
result to x0 does not cause x1 to be written.
When x0 is used as a double-width floating-point operand, the entire
operand is zero—i.e., x1 is not accessed.

Load-pair and store-pair instructions are not provided, so transferring
double-precision operands in RV32Zdinx from or to memory requires two
loads or stores. Register moves need only a single FSGNJ.D instruction,
however.

Zhinx

The Zhinx extension provides analogous half-precision floating-point
instructions. The Zhinx extension requires the Zfinx extension.
The Zhinx extension adds all of the instructions that the Zfh extension
adds, except for the transfer instructions FLH, FSH, FMV.H.X, and
FMV.X.H.
The Zhinx variants of these Zfh-extension instructions have the same
semantics, except that whenever such an instruction would have accessed
an f register, it instead accesses the x register with the same
number.
Zhinxmin

The Zhinxmin extension provides minimal support for 16-bit
half-precision floating-point instructions that operate on the x
registers. The Zhinxmin extension requires the Zfinx extension.
The Zhinxmin extension includes the following instructions from the
Zhinx extension: FCVT.S.H and FCVT.H.S. If the Zdinx extension is
present, the FCVT.D.H and FCVT.H.D instructions are also included.

In the future, an RV64Zqinx quad-precision extension could be defined
analogously to RV32Zdinx. An RV32Zqinx extension could also be defined
but would require quad-register groups.

Privileged Architecture Implications

In the standard privileged architecture defined in Volume II, the
mstatus field FS is hardwired to 0 if the Zfinx extension is
implemented, and FS no longer affects the trapping behavior of
floating-point instructions or fcsr accesses.
The misa bits F, D, and Q are hardwired to 0 when the Zfinx extension
is implemented.

A future discoverability mechanism might be used to probe the existence
of the Zfinx, Zhinx, and Zdinx extensions.

# “Zfa” Standard Extension for Additional Floating-Point Instructions, Version 0.1
Warning! This draft specification may change before being accepted as
standard by RISC-V International.
This chapter describes the Zfa standard extension, which adds
instructions for immediate loads, IEEE 754-2019 minimum and maximum
operations, round-to-integer operations, and quiet floating-point
comparisons. For RV32D, the Zfa extension also adds instructions to
transfer double-precision floating-point values to and from integer
registers, and for RV64Q, it adds analogous instructions for
quad-precision floating-point values. The Zfa extension depends on the F
extension.
Load-Immediate Instructions

The FLI.S instruction loads one of 32 single-precision floating-point
constants, encoded in the rs1 field, into floating-point register
rd. The correspondence of rs1 field values and single-precision
floating-point values is shown in
Table 1.1. FLI.S is encoded like FMV.W.X, but
with rs2=1.


rs1
Value
Sign
Exponent
Significand


0
 − 1.0
1
01111111
000...000


1
Minimum positive normal
0
00000001
000...000


2
1.0 × 2⁻¹⁶
0
01101111
000...000


3
1.0 × 2⁻¹⁵
0
01110000
000...000


4
1.0 × 2⁻⁸
0
01110111
000...000


5
1.0 × 2⁻⁷
0
01111000
000...000


6
0.0625 (2⁻⁴)
0
01111011
000...000


7
0.125 (2⁻³)
0
01111100
000...000


8
0.25
0
01111101
000...000


9
0.3125
0
01111101
010...000


10
0.375
0
01111101
100...000


11
0.4375
0
01111101
110...000


12
0.5
0
01111110
000...000


13
0.625
0
01111110
010...000


14
0.75
0
01111110
100...000


15
0.875
0
01111110
110...000


16
1.0
0
01111111
000...000


17
1.25
0
01111111
010...000


18
1.5
0
01111111
100...000


19
1.75
0
01111111
110...000


20
2.0
0
10000000
000...000


21
2.5
0
10000000
010...000


22
3
0
10000000
100...000


23
4
0
10000001
000...000


24
8
0
10000010
000...000


25
16
0
10000011
000...000


26
128 (2⁷)
0
10000110
000...000


27
256 (2⁸)
0
10000111
000...000


28
2¹⁵
0
10001110
000...000


29
2¹⁶
0
10001111
000...000


30
 + ∞
0
11111111
000...000


31
Canonical NaN
0
11111111
100...000


Immediate values loaded by the FLI.S instruction.


The preferred assembly syntax for entries 1, 30, and 31 is min, inf,
and nan, respectively. For entries 0 through 29 (including entry 1),
the assembler will accept decimal constants in C-like syntax.


The set of 32 constants was chosen by examining floating-point
libraries, including the C standard math library, and to optimize
fixed-point to floating-point conversion.
Entries 8–22 follow a regular encoding pattern. No entry sets mantissa
bits other than the two most significant ones.

If the D extension is implemented, FLI.D performs the analogous
operation, but loads a double-precision value into floating-point
register rd. Note that entry 1 (corresponding to the minimum positive
normal value) has a numerically different value for double-precision
than for single-precision. FLI.D is encoded like FLI.S, but with
fmt=D.
If the Q extension is implemented, FLI.Q performs the analogous
operation, but loads a quad-precision value into floating-point register
rd. Note that entry 1 (corresponding to the minimum positive normal
value) has a numerically different value for quad-precision. FLI.Q is
encoded like FLI.S, but with fmt=Q.
If the Zfh or Zvfh extension is implemented, FLI.H performs the
analogous operation, but loads a half-precision floating-point value
into register rd. Note that entry 1 (corresponding to the minimum
positive normal value) has a numerically different value for
half-precision. Furthermore, since 2¹⁶ is not representable
in half-precision floating-point, entry 29 in the table instead loads
positive infinity—i.e., it is redundant with entry 30. FLI.H is encoded
like FLI.S, but with fmt=H.

Additionally, since 2⁻¹⁶ is a subnormal in half-precision,
entry 1 is numerically greater than entry 2 for FLI.H.

The FLI.fmt instructions never set any floating-point exception flags.
Minimum and Maximum Instructions

The FMINM.S and FMAXM.S instructions are defined like the FMIN.S and
FMAX.S instructions, except that if either input is NaN, the result is
the canonical NaN.
If the D extension is implemented, FMINM.D and FMAXM.D instructions are
analogously defined to operate on double-precision numbers.
If the Zfh extension is implemented, FMINM.H and FMAXM.H instructions
are analogously defined to operate on half-precision numbers.
If the Q extension is implemented, FMINM.Q and FMAXM.Q instructions are
analogously defined to operate on quad-precision numbers.
These instructions are encoded like their FMIN and FMAX counterparts,
but with instruction bit 13 set to 1.

These instructions implement the IEEE 754-2019 minimum and maximum
operations.

Round-to-Integer Instructions

The FROUND.S instruction rounds the single-precision floating-point
number in floating-point register rs1 to an integer, according to the
rounding mode specified in the instruction’s rm field. It then writes
that integer, represented as a single-precision floating-point number,
to floating-point register rd. Zero and infinite inputs are copied to
rd unmodified. Signaling NaN inputs cause the invalid operation
exception flag to be set; no other exception flags are set. FROUND.S is
encoded like FCVT.S.D, but with rs2=4.
The FROUNDNX.S instruction is defined similarly, but it also sets the
inexact exception flag if the input differs from the rounded result and
is not NaN. FROUNDNX.S is encoded like FCVT.S.D, but with rs2=5.
If the D extension is implemented, FROUND.D and FROUNDNX.D instructions
are analogously defined to operate on double-precision numbers. They are
encoded like FCVT.D.S, but with rs2=4 and 5, respectively,
If the Zfh extension is implemented, FROUND.H and FROUNDNX.H
instructions are analogously defined to operate on half-precision
numbers. They are encoded like FCVT.H.S, but with rs2=4 and 5,
respectively,
If the Q extension is implemented, FROUND.Q and FROUNDNX.Q instructions
are analogously defined to operate on quad-precision numbers. They are
encoded like FCVT.Q.S, but with rs2=4 and 5, respectively,

The FROUNDNX.fmt instructions implement the IEEE 754-2019
roundToIntegralExact operation, and the FROUND.fmt instructions
implement the other operations in the roundToIntegral family.

Modular Convert-to-Integer Instruction

The FCVTMOD.W.D instruction is defined similarly to the FCVT.W.D
instruction, with the following differences. FCVTMOD.W.D always rounds
towards zero. Bits 31:0 are taken from the rounded, unbounded two’s
complement result, then sign-extended to XLEN bits and written to
integer register rd.  ± ∞ and NaN are converted to zero.
Floating-point exception flags are raised the same as they would be for
FCVT.W.D with the same input operand.
This instruction is only provided if the D extension is implemented. It
is encoded like FCVT.W.D, but with the rs2 field set to 8 and the rm
field set to 1 (RTZ). Other rm values are reserved.

The assembly syntax requires the RTZ rounding mode to be explicitly
specified, i.e., fcvtmod.w.d rd, rs1, rtz.


The FCVTMOD.W.D instruction was added principally to accelerate the
processing of JavaScript Numbers. Numbers are double-precision
values, but some operators implicitly truncate them to signed integers
mod 2³².

Move Instructions

For RV32 only, if the D extension is implemented, the FMVH.X.D
instruction moves bits 63:32 of floating-point register rs1 into
integer register rd. It is encoded in the OP-FP major opcode with
funct3=0, rs2=1, and funct7=1110001.

FMVH.X.D is used in conjunction with the existing FMV.X.W instruction to
move a double-precision floating-point number to a pair of x-registers.

For RV32 only, if the D extension is implemented, the FMVP.D.X
instruction moves a double-precision number from a pair of integer
registers into a floating-point register. Integer registers rs1 and
rs2 supply bits 31:0 and 63:32, respectively; the result is written to
floating-point register rd. FMVP.D.X is encoded in the OP-FP major
opcode with funct3=0 and funct7=1011001.
For RV64 only, if the Q extension is implemented, the FMVH.X.Q
instruction moves bits 127:64 of floating-point register rs1 into
integer register rd. It is encoded in the OP-FP major opcode with
funct3=0, rs2=1, and funct7=1110011.

FMVH.X.Q is used in conjunction with the existing FMV.X.D instruction to
move a quad-precision floating-point number to a pair of x-registers.

For RV64 only, if the Q extension is implemented, the FMVP.Q.X
instruction moves a double-precision number from a pair of integer
registers into a floating-point register. Integer registers rs1 and
rs2 supply bits 63:0 and 127:64, respectively; the result is written
to floating-point register rd. FMVP.Q.X is encoded in the OP-FP major
opcode with funct3=0 and funct7=1011011.
Comparison Instructions

The FLEQ.S and FLTQ.S instructions are defined like the FLE.S and FLT.S
instructions, except that quiet NaN inputs do not cause the invalid
operation exception flag to be set.
If the D extension is implemented, FLEQ.D and FLTQ.D instructions are
analogously defined to operate on double-precision numbers.
If the Zfh extension is implemented, FLEQ.H and FLTQ.H instructions are
analogously defined to operate on half-precision numbers.
If the Q extension is implemented, FLEQ.Q and FLTQ.Q instructions are
analogously defined to operate on quad-precision numbers.
These instructions are encoded like their FLE and FLT counterparts, but
with instruction bit 14 set to 1.

We do not expect analogous comparison instructions will be added to the
vector ISA, since they can be reasonably efficiently emulated using
masking.

# “Ztso” Standard Extension for Total Store Ordering, v0.1
This chapter defines the “Ztso” extension for the RISC-V Total Store
Ordering (RVTSO) memory consistency model. RVTSO is defined as a delta
from RVWMO, which is defined in
Chapter [sec:rvwmo].

The Ztso extension is meant to facilitate the porting of code originally
written for architectures with TSO memory models, such as x86 or some
versions of SPARC. It also supports implementations which inherently
provide RVTSO behavior and want to expose that fact to software.

RVTSO makes the following adjustments to RVWMO:


All load operations behave as if they have an acquire-RCpc
annotation


All store operations behave as if they have a release-RCpc
annotation.


All AMOs behave as if they have both acquire-RCsc and release-RCsc
annotations.


These rules render all PPO rules except
[ppo:fence]–[ppo:rcsc] redundant. They also make
redundant any non-I/O fences that do not have both PW and SR set.
Finally, they also imply that no memory operation will be reordered past
an AMO in either direction.
In the context of RVTSO, as is the case for RVWMO, the storage ordering
annotations are concisely and completely defined by PPO rules
[ppo:acquire]–[ppo:rcsc]. In both of these memory
models, it is the that allows a hart to forward a value from its store
buffer to a subsequent (in program order) load—that is to say that
stores can be forwarded locally before they are visible to other harts.

Additionally, if the Ztso extension is implemented, then vector memory
instructions in the V extension and Zve family of extensions follow
RVTSO at the instruction level. The Ztso extension does not strengthen
the ordering of intra-instruction element accesses.
In spite of the fact that Ztso adds no new instructions to the ISA, code
written assuming RVTSO will not run correctly on implementations not
supporting Ztso. Binaries compiled to run only under Ztso should
indicate as such via a flag in the binary, so that platforms which do
not implement Ztso can simply refuse to run them.
RV32/64G Instruction Set Listings

One goal of the RISC-V project is that it be used as a stable software
development target. For this purpose, we define a combination of a base
ISA (RV32I or RV64I) plus selected standard extensions (IMAFD, Zicsr,
Zifencei) as a “general-purpose” ISA, and we use the abbreviation G for
the IMAFDZicsr_Zifencei combination of instruction-set extensions. This
chapter presents opcode maps and instruction-set listings for RV32G and
RV64G.
Table [opcodemap] shows a map of the major
opcodes for RVG. Major opcodes with 3 or more lower bits set are
reserved for instruction lengths greater than 32 bits. Opcodes marked as
reserved should be avoided for custom instruction-set extensions as
they might be used by future standard extensions. Major opcodes marked
as custom-0 and custom-1 will be avoided by future standard
extensions and are recommended for use by custom instruction-set
extensions within the base 32-bit instruction format. The opcodes marked
custom-2/rv128 and custom-3/rv128 are reserved for future use by
RV128, but will otherwise be avoided for standard extensions and so can
also be used for custom instruction-set extensions in RV32 and RV64.
We believe RV32G and RV64G provide simple but complete instruction sets
for a broad range of general-purpose computing. The optional compressed
instruction set described in
Chapter [compressed] can be added (forming
RV32GC and RV64GC) to improve performance, code size, and energy
efficiency, though with some additional hardware complexity.
As we move beyond IMAFDC into further instruction-set extensions, the
added instructions tend to be more domain-specific and only provide
benefits to a restricted class of applications, e.g., for multimedia or
security. Unlike most commercial ISAs, the RISC-V ISA design clearly
separates the base ISA and broadly applicable standard extensions from
these more specialized additions.
Chapter [extensions] has a more extensive
discussion of ways to add extensions to the RISC-V ISA.
Table 1.1 lists the CSRs that have currently
been allocated CSR addresses. The timers, counters, and floating-point
CSRs are the only CSRs defined in this specification.


Number
Privilege
Name
Description


Floating-Point Control and Status Registers


0x001 
Read/write
fflags 
Floating-Point Accrued Exceptions.


0x002 
Read/write
frm 
Floating-Point Dynamic Rounding Mode.


0x003 
Read/write
fcsr 
Floating-Point Control and Status Register (frm + fflags).


Counters and Timers


0xC00 
Read-only
cycle 
Cycle counter for RDCYCLE instruction.


0xC01 
Read-only
time 
Timer for RDTIME instruction.


0xC02 
Read-only
instret 
Instructions-retired counter for RDINSTRET instruction.


0xC80 
Read-only
cycleh 
Upper 32 bits of cycle, RV32I only.


0xC81 
Read-only
timeh 
Upper 32 bits of time, RV32I only.


0xC82 
Read-only
instreth 
Upper 32 bits of instret, RV32I only.


RISC-V control and status register (CSR) address map.


inst[4:2]
000
001
010
011
100
101
110
111


inst[6:5]


( > 32b)


00
LOAD
LOAD-FP
custom-0
MISC-MEM
OP-IMM
AUIPC
OP-IMM-32
48b


01
STORE
STORE-FP
custom-1
AMO
OP
LUI
OP-32
64b


10
MADD
MSUB
NMSUB
NMADD
OP-FP
OP-V
custom-2/rv128
48b


11
BRANCH
JALR
reserved
JAL
SYSTEM
reserved
custom-3/rv128
 ≥ 80b


funct7


rs2

rs1
funct3
rd
opcode
R-type


imm[11:0]


rs1
funct3
rd
opcode
I-type


imm[11:5]


rs2

rs1
funct3
imm[4:0]
opcode
S-type


imm[12|10:5]


rs2

rs1
funct3
imm[4:1|11]
opcode
B-type


imm[31:12]


rd
opcode
U-type


imm[20|10:1|11|19:12]


rd
opcode
J-type


RV32I Base Instruction Set


imm[31:12]


rd
0110111
LUI


imm[31:12]


rd
0010111
AUIPC


imm[20|10:1|11|19:12]


rd
1101111
JAL


imm[11:0]


rs1
000
rd
1100111
JALR


imm[12|10:5]


rs2

rs1
000
imm[4:1|11]
1100011
BEQ


imm[12|10:5]


rs2

rs1
001
imm[4:1|11]
1100011
BNE


imm[12|10:5]


rs2

rs1
100
imm[4:1|11]
1100011
BLT


imm[12|10:5]


rs2

rs1
101
imm[4:1|11]
1100011
BGE


imm[12|10:5]


rs2

rs1
110
imm[4:1|11]
1100011
BLTU


imm[12|10:5]


rs2

rs1
111
imm[4:1|11]
1100011
BGEU


imm[11:0]


rs1
000
rd
0000011
LB


imm[11:0]


rs1
001
rd
0000011
LH


imm[11:0]


rs1
010
rd
0000011
LW


imm[11:0]


rs1
100
rd
0000011
LBU


imm[11:0]


rs1
101
rd
0000011
LHU


imm[11:5]


rs2

rs1
000
imm[4:0]
0100011
SB


imm[11:5]


rs2

rs1
001
imm[4:0]
0100011
SH


imm[11:5]


rs2

rs1
010
imm[4:0]
0100011
SW


imm[11:0]


rs1
000
rd
0010011
ADDI


imm[11:0]


rs1
010
rd
0010011
SLTI


imm[11:0]


rs1
011
rd
0010011
SLTIU


imm[11:0]


rs1
100
rd
0010011
XORI


imm[11:0]


rs1
110
rd
0010011
ORI


imm[11:0]


rs1
111
rd
0010011
ANDI


0000000


shamt

rs1
001
rd
0010011
SLLI


0000000


shamt

rs1
101
rd
0010011
SRLI


0100000


shamt

rs1
101
rd
0010011
SRAI


0000000


rs2

rs1
000
rd
0110011
ADD


0100000


rs2

rs1
000
rd
0110011
SUB


0000000


rs2

rs1
001
rd
0110011
SLL


0000000


rs2

rs1
010
rd
0110011
SLT


0000000


rs2

rs1
011
rd
0110011
SLTU


0000000


rs2

rs1
100
rd
0110011
XOR


0000000


rs2

rs1
101
rd
0110011
SRL


0100000


rs2

rs1
101
rd
0110011
SRA


0000000


rs2

rs1
110
rd
0110011
OR


0000000


rs2

rs1
111
rd
0110011
AND


fm

pred


succ
rs1
000
rd
0001111
FENCE


1000

0011


0011
00000
000
00000
0001111
FENCE.TSO


0000

0001


0000
00000
000
00000
0001111
PAUSE


000000000000


00000
000
00000
1110011
ECALL


000000000001


00000
000
00000
1110011
EBREAK


funct7


rs2

rs1
funct3
rd
opcode
R-type


imm[11:0]


rs1
funct3
rd
opcode
I-type


imm[11:5]


rs2

rs1
funct3
imm[4:0]
opcode
S-type


RV64I Base Instruction Set (in addition to RV32I)


imm[11:0]


rs1
110
rd
0000011
LWU


imm[11:0]


rs1
011
rd
0000011
LD


imm[11:5]


rs2

rs1
011
imm[4:0]
0100011
SD


000000


shamt


rs1
001
rd
0010011
SLLI


000000


shamt


rs1
101
rd
0010011
SRLI


010000


shamt


rs1
101
rd
0010011
SRAI


imm[11:0]


rs1
000
rd
0011011
ADDIW


0000000


shamt

rs1
001
rd
0011011
SLLIW


0000000


shamt

rs1
101
rd
0011011
SRLIW


0100000


shamt

rs1
101
rd
0011011
SRAIW


0000000


rs2

rs1
000
rd
0111011
ADDW


0100000


rs2

rs1
000
rd
0111011
SUBW


0000000


rs2

rs1
001
rd
0111011
SLLW


0000000


rs2

rs1
101
rd
0111011
SRLW


0100000


rs2

rs1
101
rd
0111011
SRAW


RV32/RV64 Zifencei Standard Extension


imm[11:0]


rs1
001
rd
0001111
FENCE.I


RV32/RV64 Zicsr Standard Extension


csr


rs1
001
rd
1110011
CSRRW


csr


rs1
010
rd
1110011
CSRRS


csr


rs1
011
rd
1110011
CSRRC


csr


uimm
101
rd
1110011
CSRRWI


csr


uimm
110
rd
1110011
CSRRSI


csr


uimm
111
rd
1110011
CSRRCI


RV32M Standard Extension


0000001


rs2

rs1
000
rd
0110011
MUL


0000001


rs2

rs1
001
rd
0110011
MULH


0000001


rs2

rs1
010
rd
0110011
MULHSU


0000001


rs2

rs1
011
rd
0110011
MULHU


0000001


rs2

rs1
100
rd
0110011
DIV


0000001


rs2

rs1
101
rd
0110011
DIVU


0000001


rs2

rs1
110
rd
0110011
REM


0000001


rs2

rs1
111
rd
0110011
REMU


RV64M Standard Extension (in addition to RV32M)


0000001


rs2

rs1
000
rd
0111011
MULW


0000001


rs2

rs1
100
rd
0111011
DIVW


0000001


rs2

rs1
101
rd
0111011
DIVUW


0000001


rs2

rs1
110
rd
0111011
REMW


0000001


rs2

rs1
111
rd
0111011
REMUW


funct7


rs2

rs1
funct3
rd
opcode
R-type


RV32A Standard Extension


00010

aq
rl
00000

rs1
010
rd
0101111
LR.W


00011

aq
rl
rs2

rs1
010
rd
0101111
SC.W


00001

aq
rl
rs2

rs1
010
rd
0101111
AMOSWAP.W


00000

aq
rl
rs2

rs1
010
rd
0101111
AMOADD.W


00100

aq
rl
rs2

rs1
010
rd
0101111
AMOXOR.W


01100

aq
rl
rs2

rs1
010
rd
0101111
AMOAND.W


01000

aq
rl
rs2

rs1
010
rd
0101111
AMOOR.W


10000

aq
rl
rs2

rs1
010
rd
0101111
AMOMIN.W


10100

aq
rl
rs2

rs1
010
rd
0101111
AMOMAX.W


11000

aq
rl
rs2

rs1
010
rd
0101111
AMOMINU.W


11100

aq
rl
rs2

rs1
010
rd
0101111
AMOMAXU.W


RV64A Standard Extension (in addition to RV32A)


00010

aq
rl
00000

rs1
011
rd
0101111
LR.D


00011

aq
rl
rs2

rs1
011
rd
0101111
SC.D


00001

aq
rl
rs2

rs1
011
rd
0101111
AMOSWAP.D


00000

aq
rl
rs2

rs1
011
rd
0101111
AMOADD.D


00100

aq
rl
rs2

rs1
011
rd
0101111
AMOXOR.D


01100

aq
rl
rs2

rs1
011
rd
0101111
AMOAND.D


01000

aq
rl
rs2

rs1
011
rd
0101111
AMOOR.D


10000

aq
rl
rs2

rs1
011
rd
0101111
AMOMIN.D


10100

aq
rl
rs2

rs1
011
rd
0101111
AMOMAX.D


11000

aq
rl
rs2

rs1
011
rd
0101111
AMOMINU.D


11100

aq
rl
rs2

rs1
011
rd
0101111
AMOMAXU.D


funct7


rs2

rs1
funct3
rd
opcode
R-type


rs3

funct2

rs2

rs1
funct3
rd
opcode
R4-type


imm[11:0]


rs1
funct3
rd
opcode
I-type


imm[11:5]


rs2

rs1
funct3
imm[4:0]
opcode
S-type


RV32F Standard Extension


imm[11:0]


rs1
010
rd
0000111
FLW


imm[11:5]


rs2

rs1
010
imm[4:0]
0100111
FSW


rs3

00

rs2

rs1
rm
rd
1000011
FMADD.S


rs3

00

rs2

rs1
rm
rd
1000111
FMSUB.S


rs3

00

rs2

rs1
rm
rd
1001011
FNMSUB.S


rs3

00

rs2

rs1
rm
rd
1001111
FNMADD.S


0000000


rs2

rs1
rm
rd
1010011
FADD.S


0000100


rs2

rs1
rm
rd
1010011
FSUB.S


0001000


rs2

rs1
rm
rd
1010011
FMUL.S


0001100


rs2

rs1
rm
rd
1010011
FDIV.S


0101100


00000

rs1
rm
rd
1010011
FSQRT.S


0010000


rs2

rs1
000
rd
1010011
FSGNJ.S


0010000


rs2

rs1
001
rd
1010011
FSGNJN.S


0010000


rs2

rs1
010
rd
1010011
FSGNJX.S


0010100


rs2

rs1
000
rd
1010011
FMIN.S


0010100


rs2

rs1
001
rd
1010011
FMAX.S


1100000


00000

rs1
rm
rd
1010011
FCVT.W.S


1100000


00001

rs1
rm
rd
1010011
FCVT.WU.S


1110000


00000

rs1
000
rd
1010011
FMV.X.W


1010000


rs2

rs1
010
rd
1010011
FEQ.S


1010000


rs2

rs1
001
rd
1010011
FLT.S


1010000


rs2

rs1
000
rd
1010011
FLE.S


1110000


00000

rs1
001
rd
1010011
FCLASS.S


1101000


00000

rs1
rm
rd
1010011
FCVT.S.W


1101000


00001

rs1
rm
rd
1010011
FCVT.S.WU


1111000


00000

rs1
000
rd
1010011
FMV.W.X


RV64F Standard Extension (in addition to RV32F)


1100000


00010

rs1
rm
rd
1010011
FCVT.L.S


1100000


00011

rs1
rm
rd
1010011
FCVT.LU.S


1101000


00010

rs1
rm
rd
1010011
FCVT.S.L


1101000


00011

rs1
rm
rd
1010011
FCVT.S.LU


funct7


rs2

rs1
funct3
rd
opcode
R-type


rs3

funct2

rs2

rs1
funct3
rd
opcode
R4-type


imm[11:0]


rs1
funct3
rd
opcode
I-type


imm[11:5]


rs2

rs1
funct3
imm[4:0]
opcode
S-type


RV32D Standard Extension


imm[11:0]


rs1
011
rd
0000111
FLD


imm[11:5]


rs2

rs1
011
imm[4:0]
0100111
FSD


rs3

01

rs2

rs1
rm
rd
1000011
FMADD.D


rs3

01

rs2

rs1
rm
rd
1000111
FMSUB.D


rs3

01

rs2

rs1
rm
rd
1001011
FNMSUB.D


rs3

01

rs2

rs1
rm
rd
1001111
FNMADD.D


0000001


rs2

rs1
rm
rd
1010011
FADD.D


0000101


rs2

rs1
rm
rd
1010011
FSUB.D


0001001


rs2

rs1
rm
rd
1010011
FMUL.D


0001101


rs2

rs1
rm
rd
1010011
FDIV.D


0101101


00000

rs1
rm
rd
1010011
FSQRT.D


0010001


rs2

rs1
000
rd
1010011
FSGNJ.D


0010001


rs2

rs1
001
rd
1010011
FSGNJN.D


0010001


rs2

rs1
010
rd
1010011
FSGNJX.D


0010101


rs2

rs1
000
rd
1010011
FMIN.D


0010101


rs2

rs1
001
rd
1010011
FMAX.D


0100000


00001

rs1
rm
rd
1010011
FCVT.S.D


0100001


00000

rs1
rm
rd
1010011
FCVT.D.S


1010001


rs2

rs1
010
rd
1010011
FEQ.D


1010001


rs2

rs1
001
rd
1010011
FLT.D


1010001


rs2

rs1
000
rd
1010011
FLE.D


1110001


00000

rs1
001
rd
1010011
FCLASS.D


1100001


00000

rs1
rm
rd
1010011
FCVT.W.D


1100001


00001

rs1
rm
rd
1010011
FCVT.WU.D


1101001


00000

rs1
rm
rd
1010011
FCVT.D.W


1101001


00001

rs1
rm
rd
1010011
FCVT.D.WU


RV64D Standard Extension (in addition to RV32D)


1100001


00010

rs1
rm
rd
1010011
FCVT.L.D


1100001


00011

rs1
rm
rd
1010011
FCVT.LU.D


1110001


00000

rs1
000
rd
1010011
FMV.X.D


1101001


00010

rs1
rm
rd
1010011
FCVT.D.L


1101001


00011

rs1
rm
rd
1010011
FCVT.D.LU


1111001


00000

rs1
000
rd
1010011
FMV.D.X


funct7


rs2

rs1
funct3
rd
opcode
R-type


rs3

funct2

rs2

rs1
funct3
rd
opcode
R4-type


imm[11:0]


rs1
funct3
rd
opcode
I-type


imm[11:5]


rs2

rs1
funct3
imm[4:0]
opcode
S-type


RV32Q Standard Extension


imm[11:0]


rs1
100
rd
0000111
FLQ


imm[11:5]


rs2

rs1
100
imm[4:0]
0100111
FSQ


rs3

11

rs2

rs1
rm
rd
1000011
FMADD.Q


rs3

11

rs2

rs1
rm
rd
1000111
FMSUB.Q


rs3

11

rs2

rs1
rm
rd
1001011
FNMSUB.Q


rs3

11

rs2

rs1
rm
rd
1001111
FNMADD.Q


0000011


rs2

rs1
rm
rd
1010011
FADD.Q


0000111


rs2

rs1
rm
rd
1010011
FSUB.Q


0001011


rs2

rs1
rm
rd
1010011
FMUL.Q


0001111


rs2

rs1
rm
rd
1010011
FDIV.Q


0101111


00000

rs1
rm
rd
1010011
FSQRT.Q


0010011


rs2

rs1
000
rd
1010011
FSGNJ.Q


0010011


rs2

rs1
001
rd
1010011
FSGNJN.Q


0010011


rs2

rs1
010
rd
1010011
FSGNJX.Q


0010111


rs2

rs1
000
rd
1010011
FMIN.Q


0010111


rs2

rs1
001
rd
1010011
FMAX.Q


0100000


00011

rs1
rm
rd
1010011
FCVT.S.Q


0100011


00000

rs1
rm
rd
1010011
FCVT.Q.S


0100001


00011

rs1
rm
rd
1010011
FCVT.D.Q


0100011


00001

rs1
rm
rd
1010011
FCVT.Q.D


1010011


rs2

rs1
010
rd
1010011
FEQ.Q


1010011


rs2

rs1
001
rd
1010011
FLT.Q


1010011


rs2

rs1
000
rd
1010011
FLE.Q


1110011


00000

rs1
001
rd
1010011
FCLASS.Q


1100011


00000

rs1
rm
rd
1010011
FCVT.W.Q


1100011


00001

rs1
rm
rd
1010011
FCVT.WU.Q


1101011


00000

rs1
rm
rd
1010011
FCVT.Q.W


1101011


00001

rs1
rm
rd
1010011
FCVT.Q.WU


RV64Q Standard Extension (in addition to RV32Q)


1100011


00010

rs1
rm
rd
1010011
FCVT.L.Q


1100011


00011

rs1
rm
rd
1010011
FCVT.LU.Q


1101011


00010

rs1
rm
rd
1010011
FCVT.Q.L


1101011


00011

rs1
rm
rd
1010011
FCVT.Q.LU


funct7


rs2

rs1
funct3
rd
opcode
R-type


rs3

funct2

rs2

rs1
funct3
rd
opcode
R4-type


imm[11:0]


rs1
funct3
rd
opcode
I-type


imm[11:5]


rs2

rs1
funct3
imm[4:0]
opcode
S-type


RV32Zfh Standard Extension


imm[11:0]


rs1
001
rd
0000111
FLH


imm[11:5]


rs2

rs1
001
imm[4:0]
0100111
FSH


rs3

10

rs2

rs1
rm
rd
1000011
FMADD.H


rs3

10

rs2

rs1
rm
rd
1000111
FMSUB.H


rs3

10

rs2

rs1
rm
rd
1001011
FNMSUB.H


rs3

10

rs2

rs1
rm
rd
1001111
FNMADD.H


0000010


rs2

rs1
rm
rd
1010011
FADD.H


0000110


rs2

rs1
rm
rd
1010011
FSUB.H


0001010


rs2

rs1
rm
rd
1010011
FMUL.H


0001110


rs2

rs1
rm
rd
1010011
FDIV.H


0101110


00000

rs1
rm
rd
1010011
FSQRT.H


0010010


rs2

rs1
000
rd
1010011
FSGNJ.H


0010010


rs2

rs1
001
rd
1010011
FSGNJN.H


0010010


rs2

rs1
010
rd
1010011
FSGNJX.H


0010110


rs2

rs1
000
rd
1010011
FMIN.H


0010110


rs2

rs1
001
rd
1010011
FMAX.H


0100000


00010

rs1
rm
rd
1010011
FCVT.S.H


0100010


00000

rs1
rm
rd
1010011
FCVT.H.S


0100001


00010

rs1
rm
rd
1010011
FCVT.D.H


0100010


00001

rs1
rm
rd
1010011
FCVT.H.D


0100011


00010

rs1
rm
rd
1010011
FCVT.Q.H


0100010


00011

rs1
rm
rd
1010011
FCVT.H.Q


1010010


rs2

rs1
010
rd
1010011
FEQ.H


1010010


rs2

rs1
001
rd
1010011
FLT.H


1010010


rs2

rs1
000
rd
1010011
FLE.H


1110010


00000

rs1
001
rd
1010011
FCLASS.H


1100010


00000

rs1
rm
rd
1010011
FCVT.W.H


1100010


00001

rs1
rm
rd
1010011
FCVT.WU.H


1110010


00000

rs1
000
rd
1010011
FMV.X.H


1101010


00000

rs1
rm
rd
1010011
FCVT.H.W


1101010


00001

rs1
rm
rd
1010011
FCVT.H.WU


1111010


00000

rs1
000
rd
1010011
FMV.H.X


RV64Zfh Standard Extension (in addition to RV32Zfh)


1100010


00010

rs1
rm
rd
1010011
FCVT.L.H


1100010


00011

rs1
rm
rd
1010011
FCVT.LU.H


1101010


00010

rs1
rm
rd
1010011
FCVT.H.L


1101010


00011

rs1
rm
rd
1010011
FCVT.H.LU


Instruction listing for RISC-V


# Extending RISC-V
In addition to supporting standard general-purpose software development,
another goal of RISC-V is to provide a basis for more specialized
instruction-set extensions or more customized accelerators. The
instruction encoding spaces and optional variable-length instruction
encoding are designed to make it easier to leverage software development
effort for the standard ISA toolchain when building more customized
processors. For example, the intent is to continue to provide full
software support for implementations that only use the standard I base,
perhaps together with many non-standard instruction-set extensions.
This chapter describes various ways in which the base RISC-V ISA can be
extended, together with the scheme for managing instruction-set
extensions developed by independent groups. This volume only deals with
the unprivileged ISA, although the same approach and terminology is used
for supervisor-level extensions described in the second volume.
Extension Terminology

This section defines some standard terminology for describing RISC-V
extensions.
Standard versus Non-Standard Extension

Any RISC-V processor implementation must support a base integer ISA
(RV32I, RV32E, RV64I, or RV128I). In addition, an implementation may
support one or more extensions. We divide extensions into two broad
categories: standard versus non-standard.


A standard extension is one that is generally useful and that is
designed to not conflict with any other standard extension.
Currently, “MAFDQLCBTPV”, described in other chapters of this
manual, are either complete or planned standard extensions.


A non-standard extension may be highly specialized and may conflict
with other standard or non-standard extensions. We anticipate a wide
variety of non-standard extensions will be developed over time, with
some eventually being promoted to standard extensions.


Instruction Encoding Spaces and Prefixes

An instruction encoding space is some number of instruction bits within
which a base ISA or ISA extension is encoded. RISC-V supports varying
instruction lengths, but even within a single instruction length, there
are various sizes of encoding space available. For example, the base
ISAs are defined within a 30-bit encoding space (bits 31–2 of the 32-bit
instruction), while the atomic extension “A” fits within a 25-bit
encoding space (bits 31–7).
We use the term prefix to refer to the bits to the right of an
instruction encoding space (since instruction fetch in RISC-V is
little-endian, the bits to the right are stored at earlier memory
addresses, hence form a prefix in instruction-fetch order). The prefix
for the standard base ISA encoding is the two-bit “11” field held in
bits 1–0 of the 32-bit word, while the prefix for the standard atomic
extension “A” is the seven-bit “0101111” field held in bits 6–0 of the
32-bit word representing the AMO major opcode. A quirk of the encoding
format is that the 3-bit funct3 field used to encode a minor opcode is
not contiguous with the major opcode bits in the 32-bit instruction
format, but is considered part of the prefix for 22-bit instruction
spaces.
Although an instruction encoding space could be of any size, adopting a
smaller set of common sizes simplifies packing independently developed
extensions into a single global encoding.
Table 1.1 gives the suggested sizes for
RISC-V.


Size
Usage
# Available in standard instruction length


16-bit
32-bit
48-bit
64-bit


14-bit
Quadrant of compressed 16-bit encoding
3


22-bit
Minor opcode in base 32-bit encoding

2⁸
2²⁰
2³⁵


25-bit
Major opcode in base 32-bit encoding

32
2¹⁷
2³²


30-bit
Quadrant of base 32-bit encoding

1
2¹²
2²⁷


32-bit
Minor opcode in 48-bit encoding


2¹⁰
2²⁵


37-bit
Major opcode in 48-bit encoding


32
2²⁰


40-bit
Quadrant of 48-bit encoding


4
2¹⁷


45-bit
Sub-minor opcode in 64-bit encoding


2¹²


48-bit
Minor opcode in 64-bit encoding


2⁹


52-bit
Major opcode in 64-bit encoding


32


Suggested standard RISC-V instruction encoding space sizes.


Greenfield versus Brownfield Extensions

We use the term greenfield extension to describe an extension that
begins populating a new instruction encoding space, and hence can only
cause encoding conflicts at the prefix level. We use the term
brownfield extension to describe an extension that fits around
existing encodings in a previously defined instruction space. A
brownfield extension is necessarily tied to a particular greenfield
parent encoding, and there may be multiple brownfield extensions to the
same greenfield parent encoding. For example, the base ISAs are
greenfield encodings of a 30-bit instruction space, while the FDQ
floating-point extensions are all brownfield extensions adding to the
parent base ISA 30-bit encoding space.
Note that we consider the standard A extension to have a greenfield
encoding as it defines a new previously empty 25-bit encoding space in
the leftmost bits of the full 32-bit base instruction encoding, even
though its standard prefix locates it within the 30-bit encoding space
of its parent base ISA. Changing only its single 7-bit prefix could move
the A extension to a different 30-bit encoding space while only worrying
about conflicts at the prefix level, not within the encoding space
itself.


Adds state
No new state


Greenfield
RV32I(30), RV64I(30)
A(25)


Brownfield
F(I), D(F), Q(D)
M(I)


Two-dimensional characterization of standard instruction-set extensions.


Table 1.2 shows the bases and standard extensions
placed in a simple two-dimensional taxonomy. One axis is whether the
extension is greenfield or brownfield, while the other axis is whether
the extension adds architectural state. For greenfield extensions, the
size of the instruction encoding space is given in parentheses. For
brownfield extensions, the name of the extension (greenfield or
brownfield) it builds upon is given in parentheses. Additional
user-level architectural state usually implies changes to the
supervisor-level system or possibly to the standard calling convention.
Note that RV64I is not considered an extension of RV32I, but a different
complete base encoding.
Standard-Compatible Global Encodings

A complete or global encoding of an ISA for an actual RISC-V
implementation must allocate a unique non-conflicting prefix for every
included instruction encoding space. The bases and every standard
extension have each had a standard prefix allocated to ensure they can
all coexist in a global encoding.
A standard-compatible global encoding is one where the base and every
included standard extension have their standard prefixes. A
standard-compatible global encoding can include non-standard extensions
that do not conflict with the included standard extensions. A
standard-compatible global encoding can also use standard prefixes for
non-standard extensions if the associated standard extensions are not
included in the global encoding. In other words, a standard extension
must use its standard prefix if included in a standard-compatible global
encoding, but otherwise its prefix is free to be reallocated. These
constraints allow a common toolchain to target the standard subset of
any RISC-V standard-compatible global encoding.
Guaranteed Non-Standard Encoding Space

To support development of proprietary custom extensions, portions of the
encoding space are guaranteed to never be used by standard extensions.
RISC-V Extension Design Philosophy

We intend to support a large number of independently developed
extensions by encouraging extension developers to operate within
instruction encoding spaces, and by providing tools to pack these into a
standard-compatible global encoding by allocating unique prefixes. Some
extensions are more naturally implemented as brownfield augmentations of
existing extensions, and will share whatever prefix is allocated to
their parent greenfield extension. The standard extension prefixes avoid
spurious incompatibilities in the encoding of core functionality, while
allowing custom packing of more esoteric extensions.
This capability of repacking RISC-V extensions into different
standard-compatible global encodings can be used in a number of ways.
One use-case is developing highly specialized custom accelerators,
designed to run kernels from important application domains. These might
want to drop all but the base integer ISA and add in only the extensions
that are required for the task in hand. The base ISAs have been designed
to place minimal requirements on a hardware implementation, and has been
encoded to use only a small fraction of a 32-bit instruction encoding
space.
Another use-case is to build a research prototype for a new type of
instruction-set extension. The researchers might not want to expend the
effort to implement a variable-length instruction-fetch unit, and so
would like to prototype their extension using a simple 32-bit
fixed-width instruction encoding. However, this new extension might be
too large to coexist with standard extensions in the 32-bit space. If
the research experiments do not need all of the standard extensions, a
standard-compatible global encoding might drop the unused standard
extensions and reuse their prefixes to place the proposed extension in a
non-standard location to simplify engineering of the research prototype.
Standard tools will still be able to target the base and any standard
extensions that are present to reduce development time. Once the
instruction-set extension has been evaluated and refined, it could then
be made available for packing into a larger variable-length encoding
space to avoid conflicts with all standard extensions.
The following sections describe increasingly sophisticated strategies
for developing implementations with new instruction-set extensions.
These are mostly intended for use in highly customized, educational, or
experimental architectures rather than for the main line of RISC-V ISA
development.
Extensions within fixed-width 32-bit instruction format

In this section, we discuss adding extensions to implementations that
only support the base fixed-width 32-bit instruction format.

We anticipate the simplest fixed-width 32-bit encoding will be popular
for many restricted accelerators and research prototypes.

Available 30-bit instruction encoding spaces

In the standard encoding, three of the available 30-bit instruction
encoding spaces (those with 2-bit prefixes 00, 01, and 10) are used to
enable the optional compressed instruction extension. However, if the
compressed instruction-set extension is not required, then these three
further 30-bit encoding spaces become available. This quadruples the
available encoding space within the 32-bit format.
Available 25-bit instruction encoding spaces

A 25-bit instruction encoding space corresponds to a major opcode in the
base and standard extension encodings.
There are four major opcodes expressly designated for custom extensions
(Table [opcodemap]), each of which represents a
25-bit encoding space. Two of these are reserved for eventual use in the
RV128 base encoding (will be OP-IMM-64 and OP-64), but can be used for
non-standard extensions for RV32 and RV64.
The two major opcodes reserved for RV64 (OP-IMM-32 and OP-32) can also
be used for non-standard extensions to RV32 only.
If an implementation does not require floating-point, then the seven
major opcodes reserved for standard floating-point extensions (LOAD-FP,
STORE-FP, MADD, MSUB, NMSUB, NMADD, OP-FP) can be reused for
non-standard extensions. Similarly, the AMO major opcode can be reused
if the standard atomic extensions are not required.
If an implementation does not require instructions longer than 32-bits,
then an additional four major opcodes are available (those marked in
gray in Table [opcodemap]).
The base RV32I encoding uses only 11 major opcodes plus 3 reserved
opcodes, leaving up to 18 available for extensions. The base RV64I
encoding uses only 13 major opcodes plus 3 reserved opcodes, leaving up
to 16 available for extensions.
Available 22-bit instruction encoding spaces

A 22-bit encoding space corresponds to a funct3 minor opcode space in
the base and standard extension encodings. Several major opcodes have a
funct3 field minor opcode that is not completely occupied, leaving
available several 22-bit encoding spaces.
Usually a major opcode selects the format used to encode operands in the
remaining bits of the instruction, and ideally, an extension should
follow the operand format of the major opcode to simplify hardware
decoding.
Other spaces

Smaller spaces are available under certain major opcodes, and not all
minor opcodes are entirely filled.
Adding aligned 64-bit instruction extensions

The simplest approach to provide space for extensions that are too large
for the base 32-bit fixed-width instruction format is to add naturally
aligned 64-bit instructions. The implementation must still support the
32-bit base instruction format, but can require that 64-bit instructions
are aligned on 64-bit boundaries to simplify instruction fetch, with a
32-bit NOP instruction used as alignment padding where necessary.
To simplify use of standard tools, the 64-bit instructions should be
encoded as described in
Figure [instlengthcode]. However, an
implementation might choose a non-standard instruction-length encoding
for 64-bit instructions, while retaining the standard encoding for
32-bit instructions. For example, if compressed instructions are not
required, then a 64-bit instruction could be encoded using one or more
zero bits in the first two bits of an instruction.

We anticipate processor generators that produce instruction-fetch units
capable of automatically handling any combination of supported
variable-length instruction encodings.

Supporting VLIW encodings

Although RISC-V was not designed as a base for a pure VLIW machine, VLIW
encodings can be added as extensions using several alternative
approaches. In all cases, the base 32-bit encoding has to be supported
to allow use of any standard software tools.
Fixed-size instruction group

The simplest approach is to define a single large naturally aligned
instruction format (e.g., 128 bits) within which VLIW operations are
encoded. In a conventional VLIW, this approach would tend to waste
instruction memory to hold NOPs, but a RISC-V-compatible implementation
would have to also support the base 32-bit instructions, confining the
VLIW code size expansion to VLIW-accelerated functions.
Encoded-Length Groups

Another approach is to use the standard length encoding from
Figure [instlengthcode] to encode parallel
instruction groups, allowing NOPs to be compressed out of the VLIW
instruction. For example, a 64-bit instruction could hold two 28-bit
operations, while a 96-bit instruction could hold three 28-bit
operations, and so on. Alternatively, a 48-bit instruction could hold
one 42-bit operation, while a 96-bit instruction could hold two 42-bit
operations, and so on.
This approach has the advantage of retaining the base ISA encoding for
instructions holding a single operation, but has the disadvantage of
requiring a new 28-bit or 42-bit encoding for operations within the VLIW
instructions, and misaligned instruction fetch for larger groups. One
simplification is to not allow VLIW instructions to straddle certain
microarchitecturally significant boundaries (e.g., cache lines or
virtual memory pages).
Fixed-Size Instruction Bundles

Another approach, similar to Itanium, is to use a larger naturally
aligned fixed instruction bundle size (e.g., 128 bits) across which
parallel operation groups are encoded. This simplifies instruction
fetch, but shifts the complexity to the group execution engine. To
remain RISC-V compatible, the base 32-bit instruction would still have
to be supported.
End-of-Group bits in Prefix

None of the above approaches retains the RISC-V encoding for the
individual operations within a VLIW instruction. Yet another approach is
to repurpose the two prefix bits in the fixed-width 32-bit encoding. One
prefix bit can be used to signal “end-of-group” if set, while the second
bit could indicate execution under a predicate if clear. Standard RISC-V
32-bit instructions generated by tools unaware of the VLIW extension
would have both prefix bits set (11) and thus have the correct
semantics, with each instruction at the end of a group and not
predicated.
The main disadvantage of this approach is that the base ISAs lack the
complex predication support usually required in an aggressive VLIW
system, and it is difficult to add space to specify more predicate
registers in the standard 30-bit encoding space.
ISA Extension Naming Conventions

This chapter describes the RISC-V ISA extension naming scheme that is
used to concisely describe the set of instructions present in a hardware
implementation, or the set of instructions used by an application binary
interface (ABI).

The RISC-V ISA is designed to support a wide variety of implementations
with various experimental instruction-set extensions. We have found that
an organized naming scheme simplifies software tools and documentation.

Case Sensitivity

The ISA naming strings are case insensitive.
Base Integer ISA

RISC-V ISA strings begin with either RV32I, RV32E, RV64I, or RV128I
indicating the supported address space size in bits for the base integer
ISA.
Instruction-Set Extension Names

Standard ISA extensions are given a name consisting of a single letter.
For example, the first four standard extensions to the integer bases
are: “M” for integer multiplication and division, “A” for atomic memory
instructions, “F” for single-precision floating-point instructions, and
“D” for double-precision floating-point instructions. Any RISC-V
instruction-set variant can be succinctly described by concatenating the
base integer prefix with the names of the included extensions, e.g.,
“RV64IMAFD”.
We have also defined an abbreviation “G” to represent the
“IMAFDZicsr_Zifencei” base and extensions, as this is intended to
represent our standard general-purpose ISA.
Standard extensions to the RISC-V ISA are given other reserved letters,
e.g., “Q” for quad-precision floating-point, or “C” for the 16-bit
compressed instruction format.
Some ISA extensions depend on the presence of other extensions, e.g.,
“D” depends on “F” and “F” depends on “Zicsr”. These dependences may be
implicit in the ISA name: for example, RV32IF is equivalent to
RV32IFZicsr, and RV32ID is equivalent to RV32IFD and RV32IFDZicsr.
Version Numbers

Recognizing that instruction sets may expand or alter over time, we
encode extension version numbers following the extension name. Version
numbers are divided into major and minor version numbers, separated by a
“p”. If the minor version is “0”, then “p0” can be omitted from the
version string. Changes in major version numbers imply a loss of
backwards compatibility, whereas changes in only the minor version
number must be backwards-compatible. For example, the original 64-bit
standard ISA defined in release 1.0 of this manual can be written in
full as “RV64I1p0M1p0A1p0F1p0D1p0”, more concisely as “RV64I1M1A1F1D1”.
We introduced the version numbering scheme with the second release.
Hence, we define the default version of a standard extension to be the
version present at that time, e.g., “RV32I” is equivalent to “RV32I2”.
Underscores

Underscores “_” may be used to separate ISA extensions to improve
readability and to provide disambiguation, e.g., “RV32I2_M2_A2”.
Because the “P” extension for Packed SIMD can be confused for the
decimal point in a version number, it must be preceded by an underscore
if it follows a number. For example, “rv32i2p2” means version 2.2 of
RV32I, whereas “rv32i2_p2” means version 2.0 of RV32I with version 2.0
of the P extension.
Additional Standard Extension Names

Standard extensions can also be named using a single “Z” followed by an
alphabetical name and an optional version number. For example,
“Zifencei” names the instruction-fetch fence extension described in
Chapter [chap:zifencei]; “Zifencei2” and
“Zifencei2p0” name version 2.0 of same.
The first letter following the “Z” conventionally indicates the most
closely related alphabetical extension category, IMAFDQCVH. For the
“Zam” extension for misaligned atomics, for example, the letter “a”
indicates the extension is related to the “A” standard extension. If
multiple “Z” extensions are named, they should be ordered first by
category, then alphabetically within a category—for example,
“Zicsr_Zifencei_Zam”.
Extensions with the “Z” prefix must be separated from other multi-letter
extensions by an underscore, e.g., “RV32IMACZicsr_Zifencei”.
Supervisor-level Instruction-Set Extensions

Standard supervisor-level instruction-set extensions are defined in
Volume II, but are named using “S” as a prefix, followed by an
alphabetical name and an optional version number. Supervisor-level
extensions must be separated from other multi-letter extensions by an
underscore.
Standard supervisor-level extensions should be listed after standard
unprivileged extensions. If multiple supervisor-level extensions are
listed, they should be ordered alphabetically.
Machine-level Instruction-Set Extensions

Standard machine-level instruction-set extensions are prefixed with the
three letters “Zxm”.
Standard machine-level extensions should be listed after standard
lesser-privileged extensions. If multiple machine-level extensions are
listed, they should be ordered alphabetically.
Non-Standard Extension Names

Non-standard extensions are named using a single “X” followed by an
alphabetical name and an optional version number. For example, “Xhwacha”
names the Hwacha vector-fetch ISA extension; “Xhwacha2” and “Xhwacha2p0”
name version 2.0 of same.
Non-standard extensions must be listed after all standard extensions.
They must be separated from other multi-letter extensions by an
underscore. For example, an ISA with non-standard extensions Argle and
Bargle may be named “RV64IZifencei_Xargle_Xbargle”.
If multiple non-standard extensions are listed, they should be ordered
alphabetically.
Subset Naming Convention

Table 1.1 summarizes the standardized
extension names.  


Subset
Name
Implies


Base ISA


Integer
I


Reduced Integer
E


Standard Unprivileged Extensions


Integer Multiplication and Division
M


Atomics
A


Single-Precision Floating-Point
F
Zicsr


Double-Precision Floating-Point
D
F


General
G
IMAFDZicsr_Zifencei


Quad-Precision Floating-Point
Q
D


16-bit Compressed Instructions
C


Packed-SIMD Extensions
P


Vector Extension
V
D


Hypervisor Extension
H


Control and Status Register Access
Zicsr


Instruction-Fetch Fence
Zifencei


Misaligned Atomics
Zam
A


Total Store Ordering
Ztso


Standard Supervisor-Level Extensions


Supervisor-level extension “def”
Sdef


Standard Machine-Level Extensions


Machine-level extension “jkl”
Zxmjkl


Non-Standard Extensions


Non-standard extension “mno”
Xmno


Standard ISA extension names. The table also defines the canonical order
in which extension names must appear in the name string, with
top-to-bottom in table indicating first-to-last in the name string,
e.g., RV32IMACV is legal, whereas RV32IMAVC is not.

# History and Acknowledgments
“Why Develop a new ISA?” Rationale from Berkeley Group

We developed RISC-V to support our own needs in research and education,
where our group is particularly interested in actual hardware
implementations of research ideas (we have completed eleven different
silicon fabrications of RISC-V since the first edition of this
specification), and in providing real implementations for students to
explore in classes (RISC-V processor RTL designs have been used in
multiple undergraduate and graduate classes at Berkeley). In our current
research, we are especially interested in the move towards specialized
and heterogeneous accelerators, driven by the power constraints imposed
by the end of conventional transistor scaling. We wanted a highly
flexible and extensible base ISA around which to build our research
effort.
A question we have been repeatedly asked is “Why develop a new ISA?” The
biggest obvious benefit of using an existing commercial ISA is the large
and widely supported software ecosystem, both development tools and
ported applications, which can be leveraged in research and teaching.
Other benefits include the existence of large amounts of documentation
and tutorial examples. However, our experience of using commercial
instruction sets for research and teaching is that these benefits are
smaller in practice, and do not outweigh the disadvantages:


Commercial ISAs are proprietary. Except for SPARC V8, which is
an open IEEE standard , most owners of commercial ISAs carefully
guard their intellectual property and do not welcome freely
available competitive implementations. This is much less of an issue
for academic research and teaching using only software simulators,
but has been a major concern for groups wishing to share actual RTL
implementations. It is also a major concern for entities who do not
want to trust the few sources of commercial ISA implementations, but
who are prohibited from creating their own clean room
implementations. We cannot guarantee that all RISC-V implementations
will be free of third-party patent infringements, but we can
guarantee we will not attempt to sue a RISC-V implementor.


Commercial ISAs are only popular in certain market domains. The
most obvious examples at time of writing are that the ARM
architecture is not well supported in the server space, and the
Intel x86 architecture (or for that matter, almost every other
architecture) is not well supported in the mobile space, though both
Intel and ARM are attempting to enter each other’s market segments.
Another example is ARC and Tensilica, which provide extensible cores
but are focused on the embedded space. This market segmentation
dilutes the benefit of supporting a particular commercial ISA as in
practice the software ecosystem only exists for certain domains, and
has to be built for others.


Commercial ISAs come and go. Previous research infrastructures
have been built around commercial ISAs that are no longer popular
(SPARC, MIPS) or even no longer in production (Alpha). These lose
the benefit of an active software ecosystem, and the lingering
intellectual property issues around the ISA and supporting tools
interfere with the ability of interested third parties to continue
supporting the ISA. An open ISA might also lose popularity, but any
interested party can continue using and developing the ecosystem.


Popular commercial ISAs are complex. The dominant commercial
ISAs (x86 and ARM) are both very complex to implement in hardware to
the level of supporting common software stacks and operating
systems. Worse, nearly all the complexity is due to bad, or at least
outdated, ISA design decisions rather than features that truly
improve efficiency.


Commercial ISAs alone are not enough to bring up applications.
Even if we expend the effort to implement a commercial ISA, this is
not enough to run existing applications for that ISA. Most
applications need a complete ABI (application binary interface) to
run, not just the user-level ISA. Most ABIs rely on libraries, which
in turn rely on operating system support. To run an existing
operating system requires implementing the supervisor-level ISA and
device interfaces expected by the OS. These are usually much less
well-specified and considerably more complex to implement than the
user-level ISA.


Popular commercial ISAs were not designed for extensibility. The
dominant commercial ISAs were not particularly designed for
extensibility, and as a consequence have added considerable
instruction encoding complexity as their instruction sets have
grown. Companies such as Tensilica (acquired by Cadence) and ARC
(acquired by Synopsys) have built ISAs and toolchains around
extensibility, but have focused on embedded applications rather than
general-purpose computing systems.


A modified commercial ISA is a new ISA. One of our main goals is
to support architecture research, including major ISA extensions.
Even small extensions diminish the benefit of using a standard ISA,
as compilers have to be modified and applications rebuilt from
source code to use the extension. Larger extensions that introduce
new architectural state also require modifications to the operating
system. Ultimately, the modified commercial ISA becomes a new ISA,
but carries along all the legacy baggage of the base ISA.


Our position is that the ISA is perhaps the most important interface in
a computing system, and there is no reason that such an important
interface should be proprietary. The dominant commercial ISAs are based
on instruction-set concepts that were already well known over 30 years
ago. Software developers should be able to target an open standard
hardware target, and commercial processor designers should compete on
implementation quality.
We are far from the first to contemplate an open ISA design suitable for
hardware implementation. We also considered other existing open ISA
designs, of which the closest to our goals was the OpenRISC
architecture . We decided against adopting the OpenRISC ISA for several
technical reasons:


OpenRISC has condition codes and branch delay slots, which
complicate higher performance implementations.


OpenRISC uses a fixed 32-bit encoding and 16-bit immediates, which
precludes a denser instruction encoding and limits space for later
expansion of the ISA.


OpenRISC does not support the 2008 revision to the IEEE 754
floating-point standard.


The OpenRISC 64-bit design had not been completed when we began.


By starting from a clean slate, we could design an ISA that met all of
our goals, though of course, this took far more effort than we had
planned at the outset. We have now invested considerable effort in
building up the RISC-V ISA infrastructure, including documentation,
compiler tool chains, operating system ports, reference ISA simulators,
FPGA implementations, efficient ASIC implementations, architecture test
suites, and teaching materials. Since the last edition of this manual,
there has been considerable uptake of the RISC-V ISA in both academia
and industry, and we have created the non-profit RISC-V Foundation to
protect and promote the standard. The RISC-V Foundation website at
https://riscv.org contains the latest information on the Foundation
membership and various open-source projects using RISC-V.
History from Revision 1.0 of ISA manual

The RISC-V ISA and instruction-set manual builds upon several earlier
projects. Several aspects of the supervisor-level machine and the
overall format of the manual date back to the T0 (Torrent-0) vector
microprocessor project at UC Berkeley and ICSI, begun in 1992. T0 was a
vector processor based on the MIPS-II ISA, with Krste Asanović as main
architect and RTL designer, and Brian Kingsbury and Bertrand Irrisou as
principal VLSI implementors. David Johnson at ICSI was a major
contributor to the T0 ISA design, particularly supervisor mode, and to
the manual text. John Hauser also provided considerable feedback on the
T0 ISA design.
The Scale (Software-Controlled Architecture for Low Energy) project at
MIT, begun in 2000, built upon the T0 project infrastructure, refined
the supervisor-level interface, and moved away from the MIPS scalar ISA
by dropping the branch delay slot. Ronny Krashinsky and Christopher
Batten were the principal architects of the Scale Vector-Thread
processor at MIT, while Mark Hampton ported the GCC-based compiler
infrastructure and tools for Scale.
A lightly edited version of the T0 MIPS scalar processor specification
(MIPS-6371) was used in teaching a new version of the MIT 6.371
Introduction to VLSI Systems class in the Fall 2002 semester, with Chris
Terman and Krste Asanović as lecturers. Chris Terman contributed most of
the lab material for the class (there was no TA!). The 6.371 class
evolved into the trial 6.884 Complex Digital Design class at MIT, taught
by Arvind and Krste Asanović in Spring 2005, which became a regular
Spring class 6.375. A reduced version of the Scale MIPS-based scalar
ISA, named SMIPS, was used in 6.884/6.375. Christopher Batten was the TA
for the early offerings of these classes and developed a considerable
amount of documentation and lab material based around the SMIPS ISA.
This same SMIPS lab material was adapted and enhanced by TA Yunsup Lee
for the UC Berkeley Fall 2009 CS250 VLSI Systems Design class taught by
John Wawrzynek, Krste Asanović, and John Lazzaro.
The Maven (Malleable Array of Vector-thread ENgines) project was a
second-generation vector-thread architecture. Its design was led by
Christopher Batten when he was an Exchange Scholar at UC Berkeley
starting in summer 2007. Hidetaka Aoki, a visiting industrial fellow
from Hitachi, gave considerable feedback on the early Maven ISA and
microarchitecture design. The Maven infrastructure was based on the
Scale infrastructure but the Maven ISA moved further away from the MIPS
ISA variant defined in Scale, with a unified floating-point and integer
register file. Maven was designed to support experimentation with
alternative data-parallel accelerators. Yunsup Lee was the main
implementor of the various Maven vector units, while Rimas Avižienis was
the main implementor of the various Maven scalar units. Yunsup Lee and
Christopher Batten ported GCC to work with the new Maven ISA.
Christopher Celio provided the initial definition of a traditional
vector instruction set (“Flood”) variant of Maven.
Based on experience with all these previous projects, the RISC-V ISA
definition was begun in Summer 2010, with Andrew Waterman, Yunsup Lee,
Krste Asanović, and David Patterson as principal designers. An initial
version of the RISC-V 32-bit instruction subset was used in the UC
Berkeley Fall 2010 CS250 VLSI Systems Design class, with Yunsup Lee as
TA. RISC-V is a clean break from the earlier MIPS-inspired designs. John
Hauser contributed to the floating-point ISA definition, including the
sign-injection instructions and a register encoding scheme that permits
internal recoding of floating-point values.
History from Revision 2.0 of ISA manual

Multiple implementations of RISC-V processors have been completed,
including several silicon fabrications, as shown in
Figure [silicon].


Name
Tapeout Date
Process
ISA


Raven-1
May 29, 2011
ST 28nm FDSOI
RV64G1_Xhwacha1


EOS14
April 1, 2012
IBM 45nm SOI
RV64G1p1_Xhwacha2


EOS16
August 17, 2012
IBM 45nm SOI
RV64G1p1_Xhwacha2


Raven-2
August 22, 2012
ST 28nm FDSOI
RV64G1p1_Xhwacha2


EOS18
February 6, 2013
IBM 45nm SOI
RV64G1p1_Xhwacha2


EOS20
July 3, 2013
IBM 45nm SOI
RV64G1p99_Xhwacha2


Raven-3
September 26, 2013
ST 28nm SOI
RV64G1p99_Xhwacha2


EOS22
March 7, 2014
IBM 45nm SOI
RV64G1p9999_Xhwacha3


The first RISC-V processors to be fabricated were written in Verilog and
manufactured in a pre-production FDSOI technology from ST as the Raven-1
testchip in 2011. Two cores were developed by Yunsup Lee and Andrew
Waterman, advised by Krste Asanović, and fabricated together: 1) an RV64
scalar core with error-detecting flip-flops, and 2) an RV64 core with an
attached 64-bit floating-point vector unit. The first microarchitecture
was informally known as “TrainWreck”, due to the short time available to
complete the design with immature design libraries.
Subsequently, a clean microarchitecture for an in-order decoupled RV64
core was developed by Andrew Waterman, Rimas Avižienis, and Yunsup Lee,
advised by Krste Asanović, and, continuing the railway theme, was
codenamed “Rocket” after George Stephenson’s successful steam locomotive
design. Rocket was written in Chisel, a new hardware design language
developed at UC Berkeley. The IEEE floating-point units used in Rocket
were developed by John Hauser, Andrew Waterman, and Brian Richards.
Rocket has since been refined and developed further, and has been
fabricated two more times in FDSOI (Raven-2, Raven-3), and five times in
IBM SOI technology (EOS14, EOS16, EOS18, EOS20, EOS22) for a photonics
project. Work is ongoing to make the Rocket design available as a
parameterized RISC-V processor generator.
EOS14–EOS22 chips include early versions of Hwacha, a 64-bit IEEE
floating-point vector unit, developed by Yunsup Lee, Andrew Waterman,
Huy Vo, Albert Ou, Quan Nguyen, and Stephen Twigg, advised by Krste
Asanović. EOS16–EOS22 chips include dual cores with a cache-coherence
protocol developed by Henry Cook and Andrew Waterman, advised by Krste
Asanović. EOS14 silicon has successfully run at . EOS16 silicon suffered
from a bug in the IBM pad libraries. EOS18 and EOS20 have successfully
run at .
Contributors to the Raven testchips include Yunsup Lee, Andrew Waterman,
Rimas Avižienis, Brian Zimmer, Jaehwa Kwak, Ruzica Jevtić, Milovan
Blagojević, Alberto Puggelli, Steven Bailey, Ben Keller, Pi-Feng Chiu,
Brian Richards, Borivoje Nikolić, and Krste Asanović.
Contributors to the EOS testchips include Yunsup Lee, Rimas Avižienis,
Andrew Waterman, Henry Cook, Huy Vo, Daiwei Li, Chen Sun, Albert Ou,
Quan Nguyen, Stephen Twigg, Vladimir Stojanović, and Krste Asanović.
Andrew Waterman and Yunsup Lee developed the C++ ISA simulator “Spike”,
used as a golden model in development and named after the golden spike
used to celebrate completion of the US transcontinental railway. Spike
has been made available as a BSD open-source project.
Andrew Waterman completed a Master’s thesis with a preliminary design of
the RISC-V compressed instruction set .
Various FPGA implementations of the RISC-V have been completed,
primarily as part of integrated demos for the Par Lab project research
retreats. The largest FPGA design has 3 cache-coherent RV64IMA
processors running a research operating system. Contributors to the FPGA
implementations include Andrew Waterman, Yunsup Lee, Rimas Avižienis,
and Krste Asanović.
RISC-V processors have been used in several classes at UC Berkeley.
Rocket was used in the Fall 2011 offering of CS250 as a basis for class
projects, with Brian Zimmer as TA. For the undergraduate CS152 class in
Spring 2012, Christopher Celio used Chisel to write a suite of
educational RV32 processors, named “Sodor” after the island on which
“Thomas the Tank Engine” and friends live. The suite includes a
microcoded core, an unpipelined core, and 2, 3, and 5-stage pipelined
cores, and is publicly available under a BSD license. The suite was
subsequently updated and used again in CS152 in Spring 2013, with Yunsup
Lee as TA, and in Spring 2014, with Eric Love as TA. Christopher Celio
also developed an out-of-order RV64 design known as BOOM (Berkeley
Out-of-Order Machine), with accompanying pipeline visualizations, that
was used in the CS152 classes. The CS152 classes also used
cache-coherent versions of the Rocket core developed by Andrew Waterman
and Henry Cook.
Over the summer of 2013, the RoCC (Rocket Custom Coprocessor) interface
was defined to simplify adding custom accelerators to the Rocket core.
Rocket and the RoCC interface were used extensively in the Fall 2013
CS250 VLSI class taught by Jonathan Bachrach, with several student
accelerator projects built to the RoCC interface. The Hwacha vector unit
has been rewritten as a RoCC coprocessor.
Two Berkeley undergraduates, Quan Nguyen and Albert Ou, have
successfully ported Linux to run on RISC-V in Spring 2013.
Colin Schmidt successfully completed an LLVM backend for RISC-V 2.0 in
January 2014.
Darius Rad at Bluespec contributed soft-float ABI support to the GCC
port in March 2014.
John Hauser contributed the definition of the floating-point
classification instructions.
We are aware of several other RISC-V core implementations, including one
in Verilog by Tommy Thorn, and one in Bluespec by Rishiyur Nikhil.
Acknowledgments

Thanks to Christopher F. Batten, Preston Briggs, Christopher Celio,
David Chisnall, Stefan Freudenberger, John Hauser, Ben Keller, Rishiyur
Nikhil, Michael Taylor, Tommy Thorn, and Robert Watson for comments on
the draft ISA version 2.0 specification.
History from Revision 2.1

Uptake of the RISC-V ISA has been very rapid since the introduction of
the frozen version 2.0 in May 2014, with too much activity to record in
a short history section such as this. Perhaps the most important single
event was the formation of the non-profit RISC-V Foundation in August
2015. The Foundation will now take over stewardship of the official
RISC-V ISA standard, and the official website riscv.org is the best
place to obtain news and updates on the RISC-V standard.
Acknowledgments

Thanks to Scott Beamer, Allen J. Baum, Christopher Celio, David
Chisnall, Paul Clayton, Palmer Dabbelt, Jan Gray, Michael Hamburg, and
John Hauser for comments on the version 2.0 specification.
History from Revision 2.2

Acknowledgments

Thanks to Jacob Bachmeyer, Alex Bradbury, David Horner, Stefan O’Rear,
and Joseph Myers for comments on the version 2.1 specification.
History for Revision 2.3

Uptake of RISC-V continues at breakneck pace.
John Hauser and Andrew Waterman contributed a hypervisor ISA extension
based upon a proposal from Paolo Bonzini.
Daniel Lustig, Arvind, Krste Asanović, Shaked Flur, Paul Loewenstein,
Yatin Manerkar, Luc Maranget, Margaret Martonosi, Vijayanand Nagarajan,
Rishiyur Nikhil, Jonas Oberhauser, Christopher Pulte, Jose Renau, Peter
Sewell, Susmit Sarkar, Caroline Trippel, Muralidaran Vijayaraghavan,
Andrew Waterman, Derek Williams, Andrew Wright, and Sizhuo Zhang
contributed the memory consistency model.
Funding

Development of the RISC-V architecture and implementations has been
partially funded by the following sponsors.


Par Lab: Research supported by Microsoft (Award #024263) and
Intel (Award #024894) funding and by matching funding by U.C.
Discovery (Award #DIG07-10227). Additional support came from Par
Lab affiliates Nokia, NVIDIA, Oracle, and Samsung.


Project Isis: DoE Award DE-SC0003624.


ASPIRE Lab: DARPA PERFECT program, Award HR0011-12-2-0016. DARPA
POEM program Award HR0011-11-C-0100. The Center for Future
Architectures Research (C-FAR), a STARnet center funded by the
Semiconductor Research Corporation. Additional support from ASPIRE
industrial sponsor, Intel, and ASPIRE affiliates, Google, Hewlett
Packard Enterprise, Huawei, Nokia, NVIDIA, Oracle, and Samsung.


The content of this paper does not necessarily reflect the position or
the policy of the US government and no official endorsement should be
inferred.
RVWMO Explanatory Material, Version 0.1

This section provides more explanation for RVWMO
(Chapter [ch:memorymodel]), using more
informal language and concrete examples. These are intended to clarify
the meaning and intent of the axioms and preserved program order rules.
This appendix should be treated as commentary; all normative material is
provided in Chapter [ch:memorymodel] and in the rest of
the main body of the ISA specification. All currently known
discrepancies are listed in
Section 1.7. Any other
discrepancies are unintentional.
Why RVWMO?

Memory consistency models fall along a loose spectrum from weak to
strong. Weak memory models allow more hardware implementation
flexibility and deliver arguably better performance, performance per
watt, power, scalability, and hardware verification overheads than
strong models, at the expense of a more complex programming model.
Strong models provide simpler programming models, but at the cost of
imposing more restrictions on the kinds of (non-speculative) hardware
optimizations that can be performed in the pipeline and in the memory
system, and in turn imposing some cost in terms of power, area overhead,
and verification burden.
RISC-V has chosen the RVWMO memory model, a variant of release
consistency. This places it in between the two extremes of the memory
model spectrum. The RVWMO memory model enables architects to build
simple implementations, aggressive implementations, implementations
embedded deeply inside a much larger system and subject to complex
memory system interactions, or any number of other possibilities, all
while simultaneously being strong enough to support programming language
memory models at high performance.
To facilitate the porting of code from other architectures, some
hardware implementations may choose to implement the Ztso extension,
which provides stricter RVTSO ordering semantics by default. Code
written for RVWMO is automatically and inherently compatible with RVTSO,
but code written assuming RVTSO is not guaranteed to run correctly on
RVWMO implementations. In fact, most RVWMO implementations will (and
should) simply refuse to run RVTSO-only binaries. Each implementation
must therefore choose whether to prioritize compatibility with RVTSO
code (e.g., to facilitate porting from x86) or whether to instead
prioritize compatibility with other RISC-V cores implementing RVWMO.
Some fences and/or memory ordering annotations in code written for RVWMO
may become redundant under RVTSO; the cost that the default of RVWMO
imposes on Ztso implementations is the incremental overhead of fetching
those fences (e.g., FENCE R,RW and FENCE RW,W) which become no-ops on
that implementation. However, these fences must remain present in the
code if compatibility with non-Ztso implementations is desired.
Litmus Tests

The explanations in this chapter make use of litmus tests, or small
programs designed to test or highlight one particular aspect of a memory
model. Figure [fig:litmus:sample] shows an
example of a litmus test with two harts. As a convention for this figure
and for all figures that follow in this chapter, we assume that
s0–s2 are pre-set to the same value in all harts and that s0 holds
the address labeled x, s1 holds y, and s2 holds z, where x,
y, and z are disjoint memory locations aligned to 8 byte boundaries.
Each figure shows the litmus test code on the left, and a visualization
of one particular valid or invalid execution on the right.

m.4m.05m.4  


Hart 0

Hart 1


⋮

⋮


li t1,1

li t4,4


(a)
sw t1,0(s0)
(e)
sw t4,0(s0)


⋮

⋮


li t2,2


(b)
sw t2,0(s0)


⋮

⋮


(c)
lw a0,0(s0)


⋮

⋮


li t3,3

li t5,5


(d)
sw t3,0(s0)
(f)
sw t5,0(s0)


⋮

⋮


| | |

Litmus tests are used to understand the implications of the memory model
in specific concrete situations. For example, in the litmus test of
Figure [fig:litmus:sample], the final
value of a0 in the first hart can be either 2, 4, or 5, depending on
the dynamic interleaving of the instruction stream from each hart at
runtime. However, in this example, the final value of a0 in Hart 0
will never be 1 or 3; intuitively, the value 1 will no longer be visible
at the time the load executes, and the value 3 will not yet be visible
by the time the load executes. We analyze this test and many others
below.


Edge
Full Name (and explanation)


rf
Reads From (from each store to the loads that return a value written by that store)


co
Coherence (a total order on the stores to each address)


fr
From-Reads (from each load to co-successors of the store from which the load returned a value)


ppo
Preserved Program Order


fence
Orderings enforced by a FENCE instruction


addr
Address Dependency


ctrl
Control Dependency


data
Data Dependency


A key for the litmus test diagrams drawn in this appendix

The diagram shown to the right of each litmus test shows a visual
representation of the particular execution candidate being considered.
These diagrams use a notation that is common in the memory model
literature for constraining the set of possible global memory orders
that could produce the execution in question. It is also the basis for
the herd models presented in
Appendix [sec:herd]. This notation is explained in
Table 1.1. Of the listed relations, rf
edges between harts, co edges, fr edges, and ppo edges directly
constrain the global memory order (as do fence, addr, data, and some
ctrl edges, via ppo). Other edges (such as intra-hart rf edges) are
informative but do not constrain the global memory order.
For example, in
Figure [fig:litmus:sample], a0=1 could
occur only if (c) reads the value written by (a) and one of the
following were true:


(b) appears before (a) in global memory order (and in the
coherence order co). However, this violates RVWMO PPO
|     rule <a href="#ppo:-|gt;st" data-reference-type="ref"
|     data-reference="ppo:-|gt;st">[ppo:-|gt;st]. The co edge from (b)
to (a) highlights this contradiction.


(a) appears before (b) in global memory order (and in the
coherence order co). However, in this case, the Load Value Axiom
would be violated, because (a) is not the latest matching store
prior to (c) in program order. The fr edge from (c) to (b)
highlights this contradiction.


Since neither of these scenarios satisfies the RVWMO axioms, the outcome
a0=1 is forbidden.
Beyond what is described in this appendix, a suite of more than seven
thousand litmus tests is available at
https://github.com/litmus-tests/litmus-tests-riscv.

The litmus tests repository also provides instructions on how to run the
litmus tests on RISC-V hardware and how to compare the results with the
operational and axiomatic models.


In the future, we expect to adapt these memory model litmus tests for
use as part of the RISC-V compliance test suite as well.

Explaining the RVWMO Rules

In this section, we provide explanation and examples for all of the
RVWMO rules and axioms.
Preserved Program Order and Global Memory Order

Preserved program order represents the subset of program order that must
be respected within the global memory order. Conceptually, events from
the same hart that are ordered by preserved program order must appear in
that order from the perspective of other harts and/or observers. Events
from the same hart that are not ordered by preserved program order, on
the other hand, may appear reordered from the perspective of other harts
and/or observers.
Informally, the global memory order represents the order in which loads
and stores perform. The formal memory model literature has moved away
from specifications built around the concept of performing, but the idea
is still useful for building up informal intuition. A load is said to
have performed when its return value is determined. A store is said to
have performed not when it has executed inside the pipeline, but rather
only when its value has been propagated to globally visible memory. In
this sense, the global memory order also represents the contribution of
the coherence protocol and/or the rest of the memory system to
interleave the (possibly reordered) memory accesses being issued by each
hart into a single total order agreed upon by all harts.
The order in which loads perform does not always directly correspond to
the relative age of the values those two loads return. In particular, a
load b may perform before another load a to the same address (i.e.,
b may execute before a, and b may appear before a in the global
memory order), but a may nevertheless return an older value than b.
This discrepancy captures (among other things) the reordering effects of
buffering placed between the core and memory. For example, b may have
returned a value from a store in the store buffer, while a may have
ignored that younger store and read an older value from memory instead.
To account for this, at the time each load performs, the value it
returns is determined by the load value axiom, not just strictly by
determining the most recent store to the same address in the global
memory order, as described below.


| p1cm|p12cm | :

Preserved program order is not required to respect the ordering of a
store followed by a load to an overlapping address. This complexity
arises due to the ubiquity of store buffers in nearly all
implementations. Informally, the load may perform (return a value) by
forwarding from the store while the store is still in the store buffer,
and hence before the store itself performs (writes back to globally
visible memory). Any other hart will therefore observe the load as
performing before the store.

| m.4 | m.45
|:-
 

Hart 0

Hart 1


li t1, 1

li t1, 1


(a)
sw t1,0(s0)
(e)
sw t1,0(s1)


(b)
lw a0,0(s0)
(f)
lw a2,0(s1)


(c)
fence r,r
(g)
fence r,r


(d)
lw a1,0(s1)
(h)
lw a3,0(s0)


Outcome: a0=1, a1=0, a2=1, a3=0


| |

Consider the litmus test of
Figure [fig:litmus:storebuffer].
When running this program on an implementation with store buffers, it is
possible to arrive at the final outcome a0=1, a1=0, a2=1, a3=0
as follows:


(a) executes and enters the first hart’s private store buffer


(b) executes and forwards its return value 1 from (a) in the store
buffer


(c) executes since all previous loads (i.e., (b)) have completed


(d) executes and reads the value 0 from memory


(e) executes and enters the second hart’s private store buffer


(f) executes and forwards its return value 1 from (e) in the store
buffer


(g) executes since all previous loads (i.e., (f)) have completed


(h) executes and reads the value 0 from memory


(a) drains from the first hart’s store buffer to memory


(e) drains from the second hart’s store buffer to memory


Therefore, the memory model must be able to account for this behavior.
To put it another way, suppose the definition of preserved program order
did include the following hypothetical rule: memory access a precedes
memory access b in preserved program order (and hence also in the
global memory order) if a precedes b in program order and a and
b are accesses to the same memory location, a is a write, and b is
a read. Call this “Rule X”. Then we get the following:


(a) precedes (b): by rule X


(b) precedes (d): by rule
[ppo:fence]


(d) precedes (e): by the load value axiom. Otherwise, if (e)
preceded (d), then (d) would be required to return the value 1.
(This is a perfectly legal execution; it’s just not the one in
question)


(e) precedes (f): by rule X


(f) precedes (h): by rule
[ppo:fence]


(h) precedes (a): by the load value axiom, as above.


The global memory order must be a total order and cannot be cyclic,
because a cycle would imply that every event in the cycle happens before
itself, which is impossible. Therefore, the execution proposed above
would be forbidden, and hence the addition of rule X would forbid
implementations with store buffer forwarding, which would clearly be
undesirable.
Nevertheless, even if (b) precedes (a) and/or (f) precedes (e) in the
global memory order, the only sensible possibility in this example is
for (b) to return the value written by (a), and likewise for (f) and
(e). This combination of circumstances is what leads to the second
option in the definition of the load value axiom. Even though (b)
precedes (a) in the global memory order, (a) will still be visible to
(b) by virtue of sitting in the store buffer at the time (b) executes.
Therefore, even if (b) precedes (a) in the global memory order, (b)
should return the value written by (a) because (a) precedes (b) in
program order. Likewise for (e) and (f).

| m.4 | m.4
|:-
 

Hart 0

Hart 1


li t1, 1

li t1, 1


(a)
sw t1,0(s0)

LOOP:


(b)
fence w,w
(d)
lw a0,0(s1)


(c)
sw t1,0(s1)

beqz a0, LOOP


(e)
sw t1,0(s2)


(f)
lw a1,0(s2)


xor a2,a1,a1


add s0,s0,a2


(g)
lw a2,0(s0)


Outcome: a0=1, a1=1, a2=0


| |

Another test that highlights the behavior of store buffers is shown in
Figure [fig:litmus:ppoca]. In this
example, (d) is ordered before (e) because of the control dependency,
and (f) is ordered before (g) because of the address dependency.
However, (e) is not necessarily ordered before (f), even though (f)
returns the value written by (e). This could correspond to the following
sequence of events:


(e) executes speculatively and enters the second hart’s private
store buffer (but does not drain to memory)


(f) executes speculatively and forwards its return value 1
from (e) in the store buffer


(g) executes speculatively and reads the value 0 from memory


(a) executes, enters the first hart’s private store buffer, and
drains to memory


(b) executes and retires


(c) executes, enters the first hart’s private store buffer, and
drains to memory


(d) executes and reads the value 1 from memory


(e), (f), and (g) commit, since the speculation turned out to be
correct


(e) drains from the store buffer to memory


| p1cm|p12cm | (for Aligned Atomics):

The RISC-V architecture decouples the notion of atomicity from the
notion of ordering. Unlike architectures such as TSO, RISC-V atomics
under RVWMO do not impose any ordering requirements by default. Ordering
semantics are only guaranteed by the PPO rules that otherwise apply.
RISC-V contains two types of atomics: AMOs and LR/SC pairs. These
conceptually behave differently, in the following way. LR/SC behave as
if the old value is brought up to the core, modified, and written back
to memory, all while a reservation is held on that memory location. AMOs
on the other hand conceptually behave as if they are performed directly
in memory. AMOs are therefore inherently atomic, while LR/SC pairs are
atomic in the slightly different sense that the memory location in
question will not be modified by another hart during the time the
original hart holds the reservation.

(a) lr.d a0, 0(s0) (b) sd t1, 0(s0) (c) sc.d t2, 0(s0)

      
(a) lr.d a0, 0(s0) (b) sw t1, 4(s0) (c) sc.d t2, 0(s0)

      
(a) lr.w a0, 0(s0) (b) sw t1, 4(s0) (c) sc.w t2, 0(s0)

      
(a) lr.w a0, 0(s0) (b) sw t1, 4(s0) (c) sc.w t2, 8(s0)

The atomicity axiom forbids stores from other harts from being
interleaved in global memory order between an LR and the SC paired with
that LR. The atomicity axiom does not forbid loads from being
interleaved between the paired operations in program order or in the
global memory order, nor does it forbid stores from the same hart or
stores to non-overlapping locations from appearing between the paired
operations in either program order or in the global memory order. For
example, the SC instructions in
Figure [fig:litmus:lrsdsc] may (but are
not guaranteed to) succeed. None of those successes would violate the
atomicity axiom, because the intervening non-conditional stores are from
the same hart as the paired load-reserved and store-conditional
instructions. This way, a memory system that tracks memory accesses at
cache line granularity (and which therefore will see the four snippets
of Figure [fig:litmus:lrsdsc] as identical)
will not be forced to fail a store-conditional instruction that happens
to (falsely) share another portion of the same cache line as the memory
location being held by the reservation.
The atomicity axiom also technically supports cases in which the LR and
SC touch different addresses and/or use different access sizes; however,
use cases for such behaviors are expected to be rare in practice.
Likewise, scenarios in which stores from the same hart between an LR/SC
pair actually overlap the memory location(s) referenced by the LR or SC
are expected to be rare compared to scenarios where the intervening
store may simply fall onto the same cache line.


| p1cm|p12cm | :

The progress axiom ensures a minimal forward progress guarantee. It
ensures that stores from one hart will eventually be made visible to
other harts in the system in a finite amount of time, and that loads
from other harts will eventually be able to read those values (or
successors thereof). Without this rule, it would be legal, for example,
for a spinlock to spin infinitely on a value, even with a store from
another hart waiting to unlock the spinlock.
The progress axiom is intended not to impose any other notion of
fairness, latency, or quality of service onto the harts in a RISC-V
implementation. Any stronger notions of fairness are up to the rest of
the ISA and/or up to the platform and/or device to define and implement.
The forward progress axiom will in almost all cases be naturally
satisfied by any standard cache coherence protocol. Implementations with
non-coherent caches may have to provide some other mechanism to ensure
the eventual visibility of all stores (or successors thereof) to all
harts.
| ### Overlapping-Address Orderings (Rules <a href="#ppo:-|gt;st" data-reference-type="ref"
| data-reference="ppo:-|gt;st">[ppo:-|gt;st]–[ppo:amoforward])

| p1cm|p12cm | Rule <a href="#ppo:-|gt;st" data-reference-type="ref"
| data-reference="ppo:-|gt;st">[ppo:-|gt;st]:

| | Rule [ppo:rdw]:

| | Rule [ppo:amoforward]:

Same-address orderings where the latter is a store are straightforward:
a load or store can never be reordered with a later store to an
overlapping memory location. From a microarchitecture perspective,
generally speaking, it is difficult or impossible to undo a
speculatively reordered store if the speculation turns out to be
invalid, so such behavior is simply disallowed by the model.
Same-address orderings from a store to a later load, on the other hand,
do not need to be enforced. As discussed in
Section 1.3.2, this reflects the
observable behavior of implementations that forward values from buffered
stores to later loads.
Same-address load-load ordering requirements are far more subtle. The
basic requirement is that a younger load must not return a value that is
older than a value returned by an older load in the same hart to the
same address. This is often known as “CoRR” (Coherence for Read-Read
pairs), or as part of a broader “coherence” or “sequential consistency
per location” requirement. Some architectures in the past have relaxed
same-address load-load ordering, but in hindsight this is generally
considered to complicate the programming model too much, and so RVWMO
requires CoRR ordering to be enforced. However, because the global
memory order corresponds to the order in which loads perform rather than
the ordering of the values being returned, capturing CoRR requirements
in terms of the global memory order requires a bit of indirection.

| m.4 | m.4
|:-
 

Hart 0

Hart 1


li t1, 1

li  t2, 2


(a)
sw t1,0(s0)
(d)
lw  a0,0(s1)


(b)
fence w, w
(e)
sw  t2,0(s1)


(c)
sw t1,0(s1)
(f)
lw  a1,0(s1)


(g)
xor t3,a1,a1


(h)
add s0,s0,t3


(i)
lw  a2,0(s0)


Outcome: a0=1, a1=2, a2=0


| |

Consider the litmus test of
Figure [fig:litmus:frirfi], which is one
particular instance of the more general “fri-rfi” pattern. The term
“fri-rfi” refers to the sequence (d), (e), (f): (d) “from-reads” (i.e.,
reads from an earlier write than) (e) which is the same hart, and (f)
reads from (e) which is in the same hart.
From a microarchitectural perspective, outcome a0=1, a1=2, a2=0 is
legal (as are various other less subtle outcomes). Intuitively, the
following would produce the outcome in question:


(d) stalls (for whatever reason; perhaps it’s stalled waiting for
some other preceding instruction)


(e) executes and enters the store buffer (but does not yet drain
to memory)


(f) executes and forwards from (e) in the store buffer


(g), (h), and (i) execute


(a) executes and drains to memory, (b) executes, and (c) executes
and drains to memory


(d) unstalls and executes


(e) drains from the store buffer to memory


This corresponds to a global memory order of (f), (i), (a), (c), (d),
(e). Note that even though (f) performs before (d), the value returned
by (f) is newer than the value returned by (d). Therefore, this
execution is legal and does not violate the CoRR requirements.
Likewise, if two back-to-back loads return the values written by the
same store, then they may also appear out-of-order in the global memory
order without violating CoRR. Note that this is not the same as saying
that the two loads return the same value, since two different stores may
write the same value.

| m.4 | m.6
|:-
 

Hart 0

Hart 1


li t1, 1
(d)
lw  a0,0(s1)


(a)
sw t1,0(s0)
(e)
xor t2,a0,a0


(b)
fence w, w
(f)
add s4,s2,t2


(c)
sw t1,0(s1)
(g)
lw  a1,0(s4)


(h)
lw  a2,0(s2)


(i)
xor t3,a2,a2


(j)
add s0,s0,t3


(k)
lw  a3,0(s0)


Outcome: a0=1, a1=v, a2=v, a3=0


| |

Consider the litmus test of
Figure [fig:litmus:rsw]. The outcome
a0=1, a1=v, a2=v, a3=0 (where v is some value written by
another hart) can be observed by allowing (g) and (h) to be reordered.
This might be done speculatively, and the speculation can be justified
by the microarchitecture (e.g., by snooping for cache invalidations and
finding none) because replaying (h) after (g) would return the value
written by the same store anyway. Hence assuming a1 and a2 would end
up with the same value written by the same store anyway, (g) and (h) can
be legally reordered. The global memory order corresponding to this
execution would be (h),(k),(a),(c),(d),(g).
Executions of the test in
Figure [fig:litmus:rsw] in which a1 does
not equal a2 do in fact require that (g) appears before (h) in the
global memory order. Allowing (h) to appear before (g) in the global
memory order would in that case result in a violation of CoRR, because
then (h) would return an older value than that returned by (g).
Therefore, PPO rule [ppo:rdw] forbids this CoRR violation from
occurring. As such, PPO
rule [ppo:rdw] strikes a careful balance between
enforcing CoRR in all cases while simultaneously being weak enough to
permit “RSW” and “fri-rfi” patterns that commonly appear in real
microarchitectures.
There is one more overlapping-address rule: PPO
rule [ppo:amoforward] simply states that
a value cannot be returned from an AMO or SC to a subsequent load until
the AMO or SC has (in the case of the SC, successfully) performed
globally. This follows somewhat naturally from the conceptual view that
both AMOs and SC instructions are meant to be performed atomically in
memory. However, notably, PPO
rule [ppo:amoforward] states that
hardware may not even non-speculatively forward the value being stored
by an AMOSWAP to a subsequent load, even though for AMOSWAP that store
value is not actually semantically dependent on the previous value in
memory, as is the case for the other AMOs. The same holds true even when
forwarding from SC store values that are not semantically dependent on
the value returned by the paired LR.
The three PPO rules above also apply when the memory accesses in
question only overlap partially. This can occur, for example, when
accesses of different sizes are used to access the same object. Note
also that the base addresses of two overlapping memory operations need
not necessarily be the same for two memory accesses to overlap. When
misaligned memory accesses are being used, the overlapping-address PPO
rules apply to each of the component memory accesses independently.
Fences (Rule <a href="#ppo:fence" data-reference-type="ref"

data-reference="ppo:fence">[ppo:fence])

| p1cm|p12cm | Rule [ppo:fence]:

By default, the FENCE instruction ensures that all memory accesses from
instructions preceding the fence in program order (the “predecessor
set”) appear earlier in the global memory order than memory accesses
from instructions appearing after the fence in program order (the
“successor set”). However, fences can optionally further restrict the
predecessor set and/or the successor set to a smaller set of memory
accesses in order to provide some speedup. Specifically, fences have PR,
PW, SR, and SW bits which restrict the predecessor and/or successor
sets. The predecessor set includes loads (resp.stores) if and only if PR
(resp.PW) is set. Similarly, the successor set includes loads
(resp.stores) if and only if SR (resp.SW) is set.
The FENCE encoding currently has nine non-trivial combinations of the
four bits PR, PW, SR, and SW, plus one extra encoding FENCE.TSO which
facilitates mapping of “acquire+release” or RVTSO semantics. The
remaining seven combinations have empty predecessor and/or successor
sets and hence are no-ops. Of the ten non-trivial options, only six are
commonly used in practice:


FENCE RW,RW


FENCE.TSO


FENCE RW,W


FENCE R,RW


FENCE R,R


FENCE W,W


FENCE instructions using any other combination of PR, PW, SR, and SW are
reserved. We strongly recommend that programmers stick to these six.
Other combinations may have unknown or unexpected interactions with the
memory model.
Finally, we note that since RISC-V uses a multi-copy atomic memory
model, programmers can reason about fences bits in a thread-local
manner. There is no complex notion of “fence cumulativity” as found in
memory models that are not multi-copy atomic.
Explicit Synchronization (Rules <a href="#ppo:acquire" data-reference-type="ref"

data-reference="ppo:acquire">[ppo:acquire]–[ppo:pair])

| p1cm|p12cm | Rule [ppo:acquire]:

| | Rule [ppo:release]:

| | Rule [ppo:rcsc]:

| | Rule [ppo:pair]:

An acquire operation, as would be used at the start of a critical
section, requires all memory operations following the acquire in program
order to also follow the acquire in the global memory order. This
ensures, for example, that all loads and stores inside the critical
section are up to date with respect to the synchronization variable
being used to protect it. Acquire ordering can be enforced in one of two
ways: with an acquire annotation, which enforces ordering with respect
to just the synchronization variable itself, or with a FENCE R,RW, which
enforces ordering with respect to all previous loads.
          sd           x1, (a1)     # Arbitrary unrelated store
          ld           x2, (a2)     # Arbitrary unrelated load
          li           t0, 1        # Initialize swap value.
      again:
          amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
          bnez         t0, again    # Retry if held.
          # ...
          # Critical section.
          # ...
          amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
          sd           x3, (a3)     # Arbitrary unrelated store
          ld           x4, (a4)     # Arbitrary unrelated load

Consider
Figure [fig:litmus:spinlock_atomics].
Because this example uses aq, the loads and stores in the critical
section are guaranteed to appear in the global memory order after the
AMOSWAP used to acquire the lock. However, assuming a0, a1, and a2
point to different memory locations, the loads and stores in the
critical section may or may not appear after the “Arbitrary unrelated
load” at the beginning of the example in the global memory order.
          sd           x1, (a1)     # Arbitrary unrelated store
          ld           x2, (a2)     # Arbitrary unrelated load
          li           t0, 1        # Initialize swap value.
      again:
          amoswap.w    t0, t0, (a0) # Attempt to acquire lock.
          fence        r, rw        # Enforce "acquire" memory ordering
          bnez         t0, again    # Retry if held.
          # ...
          # Critical section.
          # ...
          fence        rw, w        # Enforce "release" memory ordering
          amoswap.w    x0, x0, (a0) # Release lock by storing 0.
          sd           x3, (a3)     # Arbitrary unrelated store
          ld           x4, (a4)     # Arbitrary unrelated load

Now, consider the alternative in
Figure [fig:litmus:spinlock_fences].
In this case, even though the AMOSWAP does not enforce ordering with an
aq bit, the fence nevertheless enforces that the acquire AMOSWAP
appears earlier in the global memory order than all loads and stores in
the critical section. Note, however, that in this case, the fence also
enforces additional orderings: it also requires that the “Arbitrary
unrelated load” at the start of the program appears earlier in the
global memory order than the loads and stores of the critical section.
(This particular fence does not, however, enforce any ordering with
respect to the “Arbitrary unrelated store” at the start of the snippet.)
In this way, fence-enforced orderings are slightly coarser than
orderings enforced by .aq.
Release orderings work exactly the same as acquire orderings, just in
the opposite direction. Release semantics require all loads and stores
preceding the release operation in program order to also precede the
release operation in the global memory order. This ensures, for example,
that memory accesses in a critical section appear before the
lock-releasing store in the global memory order. Just as for acquire
semantics, release semantics can be enforced using release annotations
or with a FENCE RW,W operation. Using the same examples, the ordering
between the loads and stores in the critical section and the “Arbitrary
unrelated store” at the end of the code snippet is enforced only by the
FENCE RW,W in
Figure [fig:litmus:spinlock_fences],
not by the rl in
Figure [fig:litmus:spinlock_atomics].
With RCpc annotations alone, store-release-to-load-acquire ordering is
not enforced. This facilitates the porting of code written under the TSO
and/or RCpc memory models. To enforce store-release-to-load-acquire
ordering, the code must use store-release-RCsc and load-acquire-RCsc
operations so that PPO rule
[ppo:rcsc] applies. RCpc alone is
sufficient for many use cases in C/C++ but is insufficient for many
other use cases in C/C++, Java, and Linux, to name just a few examples;
see Section 1.5 for details.
PPO rule [ppo:pair] indicates that an SC must
appear after its paired LR in the global memory order. This will follow
naturally from the common use of LR/SC to perform an atomic
read-modify-write operation due to the inherent data dependency.
However, PPO rule [ppo:pair] also applies even when the
value being stored does not syntactically depend on the value returned
by the paired LR.
Lastly, we note that just as with fences, programmers need not worry
about “cumulativity” when analyzing ordering annotations.
Syntactic Dependencies (Rules <a href="#ppo:addr" data-reference-type="ref"

data-reference="ppo:addr">[ppo:addr]–[ppo:ctrl])

| p1cm|p12cm | Rule [ppo:addr]:

| | Rule [ppo:data]:

| | Rule [ppo:ctrl]:

Dependencies from a load to a later memory operation in the same hart
are respected by the RVWMO memory model. The Alpha memory model was
notable for choosing not to enforce the ordering of such dependencies,
but most modern hardware and software memory models consider allowing
dependent instructions to be reordered too confusing and
counterintuitive. Furthermore, modern code sometimes intentionally uses
such dependencies as a particularly lightweight ordering enforcement
mechanism.
The terms in
Section [sec:memorymodel:dependencies]
work as follows. Instructions are said to carry dependencies from their
source register(s) to their destination register(s) whenever the value
written into each destination register is a function of the source
register(s). For most instructions, this means that the destination
register(s) carry a dependency from all source register(s). However,
there are a few notable exceptions. In the case of memory instructions,
the value written into the destination register ultimately comes from
the memory system rather than from the source register(s) directly, and
so this breaks the chain of dependencies carried from the source
register(s). In the case of unconditional jumps, the value written into
the destination register comes from the current pc (which is never
considered a source register by the memory model), and so likewise, JALR
(the only jump with a source register) does not carry a dependency from
rs1 to rd.

(a) fadd f3,f1,f2 (b) fadd f6,f4,f5 (c) csrrs a0,fflags,x0

The notion of accumulating into a destination register rather than
writing into it reflects the behavior of CSRs such as fflags. In
particular, an accumulation into a register does not clobber any
previous writes or accumulations into the same register. For example, in
Figure [fig:litmus:fflags], (c) has a
syntactic dependency on both (a) and (b).
Like other modern memory models, the RVWMO memory model uses syntactic
rather than semantic dependencies. In other words, this definition
depends on the identities of the registers being accessed by different
instructions, not the actual contents of those registers. This means
that an address, control, or data dependency must be enforced even if
the calculation could seemingly be “optimized away”. This choice ensures
that RVWMO remains compatible with code that uses these false syntactic
dependencies as a lightweight ordering mechanism.

ld a1,0(s0) xor a2,a1,a1 add s1,s1,a2 ld a5,0(s1)

For example, there is a syntactic address dependency from the memory
operation generated by the first instruction to the memory operation
generated by the last instruction in
Figure [fig:litmus:address], even
though a1 XOR a1 is zero and hence has no effect on the address
accessed by the second load.
The benefit of using dependencies as a lightweight synchronization
mechanism is that the ordering enforcement requirement is limited only
to the specific two instructions in question. Other non-dependent
instructions may be freely reordered by aggressive implementations. One
alternative would be to use a load-acquire, but this would enforce
ordering for the first load with respect to all subsequent
instructions. Another would be to use a FENCE R,R, but this would
include all previous and all subsequent loads, making this option more
expensive.

lw x1,0(x2) bne x1,x0,next sw x3,0(x4) next: sw x5,0(x6)

Control dependencies behave differently from address and data
dependencies in the sense that a control dependency always extends to
all instructions following the original target in program order.
Consider Figure [fig:litmus:control1]: the
instruction at next will always execute, but the memory operation
generated by that last instruction nevertheless still has a control
dependency from the memory operation generated by the first instruction.

lw x1,0(x2) bne x1,x0,next next: sw x3,0(x4)

Likewise, consider
Figure [fig:litmus:control2]. Even
though both branch outcomes have the same target, there is still a
control dependency from the memory operation generated by the first
instruction in this snippet to the memory operation generated by the
last instruction. This definition of control dependency is subtly
stronger than what might be seen in other contexts (e.g., C++), but it
conforms with standard definitions of control dependencies in the
literature.
Notably, PPO rules [ppo:addr]–[ppo:ctrl] are also intentionally designed
to respect dependencies that originate from the output of a successful
store-conditional instruction. Typically, an SC instruction will be
followed by a conditional branch checking whether the outcome was
successful; this implies that there will be a control dependency from
the store operation generated by the SC instruction to any memory
operations following the branch. PPO
rule [ppo:ctrl] in turn implies that any
subsequent store operations will appear later in the global memory order
than the store operation generated by the SC. However, since control,
address, and data dependencies are defined over memory operations, and
since an unsuccessful SC does not generate a memory operation, no order
is enforced between unsuccessful SC and its dependent instructions.
Moreover, since SC is defined to carry dependencies from its source
registers to rd only when the SC is successful, an unsuccessful SC has
no effect on the global memory order.

m.4m0.05m.4
 

Initial values: 0(s0)=1; 0(s2)=1


Hart 0

Hart 1


(a)
ld a0,0(s0)
(e)
ld a3,0(s2)


(b)
lr a1,0(s1)
(f)
sd a3,0(s0)


(c)
sc a2,a0,0(s1)


(d)
sd a2,0(s2)


Outcome: a0=0, a3=0


| | |

In addition, the choice to respect dependencies originating at
store-conditional instructions ensures that certain out-of-thin-air-like
behaviors will be prevented. Consider
Figure [fig:litmus:successdeps].
Suppose a hypothetical implementation could occasionally make some early
guarantee that a store-conditional operation will succeed. In this case,
(c) could return 0 to a2 early (before actually executing), allowing
the sequence (d), (e), (f), (a), and then (b) to execute, and then (c)
might execute (successfully) only at that point. This would imply that
(c) writes its own success value to 0(s1)! Fortunately, this situation
and others like it are prevented by the fact that RVWMO respects
dependencies originating at the stores generated by successful SC
instructions.
We also note that syntactic dependencies between instructions only have
any force when they take the form of a syntactic address, control,
and/or data dependency. For example: a syntactic dependency between two
“F” instructions via one of the “accumulating CSRs” in
Section [sec:source-dest-regs] does
not imply that the two “F” instructions must be executed in order.
Such a dependency would only serve to ultimately set up later a
dependency from both “F” instructions to a later CSR instruction
accessing the CSR flag in question.
Pipeline Dependencies (Rules <a href="#ppo:addrdatarfi" data-reference-type="ref"

data-reference="ppo:addrdatarfi">[ppo:addrdatarfi]–[ppo:addrpo])

| p1cm|p12cm | Rule [ppo:addrdatarfi]:

| | Rule [ppo:addrpo]:


m.4m.05m.4
 

Hart 0

Hart 1


li t1, 1
(d)
lw a0, 0(s1)


(a)
sw t1,0(s0)
(e)
sw a0, 0(s2)


(b)
fence w, w
(f)
lw a1, 0(s2)


(c)
sw t1,0(s1)

xor a2,a1,a1


add s0,s0,a2


(g)
lw a3,0(s0)


Outcome: a0=1, a3=0


| | |

PPO rules [ppo:addrdatarfi] and
[ppo:addrpo] reflect behaviors of almost
all real processor pipeline implementations.
Rule [ppo:addrdatarfi] states that a
load cannot forward from a store until the address and data for that
store are known. Consider
Figure [fig:litmus:addrdatarfi]:
(f) cannot be executed until the data for (e) has been resolved, because
(f) must return the value written by (e) (or by something even later in
the global memory order), and the old value must not be clobbered by the
writeback of (e) before (d) has had a chance to perform. Therefore, (f)
will never perform before (d) has performed.

m.4m.05m.4
 

Hart 0

Hart 1


li t1, 1

li t1, 1


(a)
sw t1,0(s0)
(d)
lw a0, 0(s1)


(b)
fence w, w
(e)
sw a0, 0(s2)


(c)
sw t1,0(s1)
(f)
sw t1, 0(s2)


(g)
lw a1, 0(s2)


xor a2,a1,a1


add s0,s0,a2


(h)
lw a3,0(s0)


Outcome: a0=1, a3=0


| | |

If there were another store to the same address in between (e) and (f),
as in
Figure [fig:litmus:addrdatarfi_no],
then (f) would no longer be dependent on the data of (e) being resolved,
and hence the dependency of (f) on (d), which produces the data for (e),
would be broken.
Rule [ppo:addrpo] makes a similar observation
to the previous rule: a store cannot be performed at memory until all
previous loads that might access the same address have themselves been
performed. Such a load must appear to execute before the store, but it
cannot do so if the store were to overwrite the value in memory before
the load had a chance to read the old value. Likewise, a store generally
cannot be performed until it is known that preceding instructions will
not cause an exception due to failed address resolution, and in this
sense, rule [ppo:addrpo] can be seen as somewhat of
a special case of rule [ppo:ctrl].

m.4m.05m.4  


Hart 0

Hart 1


li t1, 1


(a)
lw a0,0(s0)
(d)
lw a1, 0(s1)


(b)
fence rw,rw
(e)
lw a2, 0(a1)


(c)
sw s2,0(s1)
(f)
sw t1, 0(s0)


Outcome: a0=1, a1=t


| | |

Consider Figure [fig:litmus:addrpo]: (f) cannot
be executed until the address for (e) is resolved, because it may turn
out that the addresses match; i.e., that a1=s0. Therefore, (f) cannot
be sent to memory before (d) has executed and confirmed whether the
addresses do indeed overlap.
Beyond Main Memory

RVWMO does not currently attempt to formally describe how FENCE.I,
SFENCE.VMA, I/O fences, and PMAs behave. All of these behaviors will be
described by future formalizations. In the meantime, the behavior of
FENCE.I is described in
Chapter [chap:zifencei], the behavior of
SFENCE.VMA is described in the RISC-V Instruction Set Privileged
Architecture Manual, and the behavior of I/O fences and the effects of
PMAs are described below.
Coherence and Cacheability

The RISC-V Privileged ISA defines Physical Memory Attributes (PMAs)
which specify, among other things, whether portions of the address space
are coherent and/or cacheable. See the RISC-V Privileged ISA
Specification for the complete details. Here, we simply discuss how the
various details in each PMA relate to the memory model:


Main memory vs.I/O, and I/O memory ordering PMAs: the memory model
as defined applies to main memory regions. I/O ordering is discussed
below.


Supported access types and atomicity PMAs: the memory model is
simply applied on top of whatever primitives each region supports.


Cacheability PMAs: the cacheability PMAs in general do not affect
the memory model. Non-cacheable regions may have more restrictive
behavior than cacheable regions, but the set of allowed behaviors
does not change regardless. However, some platform-specific and/or
device-specific cacheability settings may differ.


Coherence PMAs: The memory consistency model for memory regions
marked as non-coherent in PMAs is currently platform-specific and/or
device-specific: the load-value axiom, the atomicity axiom, and the
progress axiom all may be violated with non-coherent memory. Note
however that coherent memory does not require a hardware cache
coherence protocol. The RISC-V Privileged ISA Specification suggests
that hardware-incoherent regions of main memory are discouraged, but
the memory model is compatible with hardware coherence, software
coherence, implicit coherence due to read-only memory, implicit
coherence due to only one agent having access, or otherwise.


Idempotency PMAs: Idempotency PMAs are used to specify memory
regions for which loads and/or stores may have side effects, and
this in turn is used by the microarchitecture to determine, e.g.,
whether prefetches are legal. This distinction does not affect the
memory model.


I/O Ordering

For I/O, the load value axiom and atomicity axiom in general do not
apply, as both reads and writes might have device-specific side effects
and may return values other than the value “written” by the most recent
store to the same address. Nevertheless, the following preserved program
order rules still generally apply for accesses to I/O memory: memory
access a precedes memory access b in global memory order if a
precedes b in program order and one or more of the following holds:


a precedes b in preserved program order as defined in
Chapter [ch:memorymodel], with the
exception that acquire and release ordering annotations apply only
from one memory operation to another memory operation and from one
I/O operation to another I/O operation, but not from a memory
operation to an I/O nor vice versa


a and b are accesses to overlapping addresses in an I/O region


a and b are accesses to the same strongly ordered I/O region


a and b are accesses to I/O regions, and the channel associated
with the I/O region accessed by either a or b is channel 1


a and b are accesses to I/O regions associated with the same
channel (except for channel 0)


Note that the FENCE instruction distinguishes between main memory
operations and I/O operations in its predecessor and successor sets. To
enforce ordering between I/O operations and main memory operations, code
must use a FENCE with PI, PO, SI, and/or SO, plus PR, PW, SR, and/or SW.
For example, to enforce ordering between a write to main memory and an
I/O write to a device register, a FENCE W,O or stronger is needed.

sd t0, 0(a0) fence w,o sd a0, 0(a1)

When a fence is in fact used, implementations must assume that the
device may attempt to access memory immediately after receiving the MMIO
signal, and subsequent memory accesses from that device to memory must
observe the effects of all accesses ordered prior to that MMIO
operation. In other words, in
Figure [fig:litmus:wo], suppose 0(a0) is
in main memory and 0(a1) is the address of a device register in I/O
memory. If the device accesses 0(a0) upon receiving the MMIO write,
then that load must conceptually appear after the first store to 0(a0)
according to the rules of the RVWMO memory model. In some
implementations, the only way to ensure this will be to require that the
first store does in fact complete before the MMIO write is issued. Other
implementations may find ways to be more aggressive, while others still
may not need to do anything different at all for I/O and main memory
accesses. Nevertheless, the RVWMO memory model does not distinguish
between these options; it simply provides an implementation-agnostic
mechanism to specify the orderings that must be enforced.
Many architectures include separate notions of “ordering” and
“completion” fences, especially as it relates to I/O (as opposed to
regular main memory). Ordering fences simply ensure that memory
operations stay in order, while completion fences ensure that
predecessor accesses have all completed before any successors are made
visible. RISC-V does not explicitly distinguish between ordering and
completion fences. Instead, this distinction is simply inferred from
different uses of the FENCE bits.
For implementations that conform to the RISC-V Unix Platform
Specification, I/O devices and DMA operations are required to access
memory coherently and via strongly ordered I/O channels. Therefore,
accesses to regular main memory regions that are concurrently accessed
by external devices can also use the standard synchronization
mechanisms. Implementations that do not conform to the Unix Platform
Specification and/or in which devices do not access memory coherently
will need to use mechanisms (which are currently platform-specific or
device-specific) to enforce coherency.
I/O regions in the address space should be considered non-cacheable
regions in the PMAs for those regions. Such regions can be considered
coherent by the PMA if they are not cached by any agent.
The ordering guarantees in this section may not apply beyond a
platform-specific boundary between the RISC-V cores and the device. In
particular, I/O accesses sent across an external bus (e.g., PCIe) may be
reordered before they reach their ultimate destination. Ordering must be
enforced in such situations according to the platform-specific rules of
those external devices and buses.
Code Porting and Mapping Guidelines


x86/TSO Operation
RVWMO Mapping


Load
`l{b


Store
`fence rw,w; s{b


Atomic RMW
`amo.{w


`loop:lr.{w


Fence
fence rw,rw 


Mappings from TSO operations to RISC-V operations

Table 1.2 provides a mapping from TSO
memory operations onto RISC-V memory instructions. Normal x86 loads and
stores are all inherently acquire-RCpc and release-RCpc operations: TSO
enforces all load-load, load-store, and store-store ordering by default.
Therefore, under RVWMO, all TSO loads must be mapped onto a load
followed by FENCE R,RW, and all TSO stores must be mapped onto
FENCE RW,W followed by a store. TSO atomic read-modify-writes and x86
instructions using the LOCK prefix are fully ordered and can be
implemented either via an AMO with both aq and rl set, or via an LR
with aq set, the arithmetic operation in question, an SC with both
aq and rl set, and a conditional branch checking the success
condition. In the latter case, the rl annotation on the LR turns out
(for non-obvious reasons) to be redundant and can be omitted.
Alternatives to
Table 1.2 are also possible. A TSO store
can be mapped onto AMOSWAP with rl set. However, since RVWMO PPO
Rule [ppo:amoforward] forbids forwarding
of values from AMOs to subsequent loads, the use of AMOSWAP for stores
may negatively affect performance. A TSO load can be mapped using LR
with aq set: all such LR instructions will be unpaired, but that fact
in and of itself does not preclude the use of LR for loads. However,
again, this mapping may also negatively affect performance if it puts
more pressure on the reservation mechanism than was originally intended.


Power Operation
RVWMO Mapping


Load
`l{b


Load-Reserve
`lr.{w


Store
`s{b


Store-Conditional
`sc.{w


lwsync 
fence.tso 


sync 
fence rw,rw 


isync 
fence.i; fence r,r 


Mappings from Power operations to RISC-V operations

Table 1.3 provides a mapping from Power
memory operations onto RISC-V memory instructions. Power ISYNC maps on
RISC-V to a FENCE.I followed by a FENCE R,R; the latter fence is needed
because ISYNC is used to define a “control+control fence” dependency
that is not present in RVWMO.


ARM Operation
RVWMO Mapping


Load
`l{b


Load-Acquire
`fence rw, rw; l{b


Load-Exclusive
`lr.{w


Load-Acquire-Exclusive
`lr.{w


Store
`s{b


Store-Release
`fence rw,w; s{b


Store-Exclusive
`sc.{w


Store-Release-Exclusive
`sc.{w


dmb 
fence rw,rw 


dmb.ld 
fence r,rw 


dmb.st 
fence w,w 


isb 
fence.i; fence r,r 


Mappings from ARM operations to RISC-V operations

Table 1.4 provides a mapping from ARM
memory operations onto RISC-V memory instructions. Since RISC-V does not
currently have plain load and store opcodes with aq or rl
annotations, ARM load-acquire and store-release operations should be
mapped using fences instead. Furthermore, in order to enforce
store-release-to-load-acquire ordering, there must be a FENCE RW,RW
between the store-release and load-acquire;
Table 1.4 enforces this by always placing
the fence in front of each acquire operation. ARM load-exclusive and
store-exclusive instructions can likewise map onto their RISC-V LR and
SC equivalents, but instead of placing a FENCE RW,RW in front of an LR
with aq set, we simply also set rl instead. ARM ISB maps on RISC-V
to FENCE.I followed by FENCE R,R similarly to how ISYNC maps for Power.


Linux Operation
RVWMO Mapping


smp_mb() 
fence rw,rw 


smp_rmb() 
fence r,r 


smp_wmb() 
fence w,w 


dma_rmb() 
fence r,r 


dma_wmb() 
fence w,w 


mb() 
fence iorw,iorw 


rmb() 
fence ri,ri 


wmb() 
fence wo,wo 


smp_load_acquire() 
`l{b


smp_store_release() 
`fence.tso; s{b


Linux Construct
RVWMO AMO Mapping


atomic_<op>_relaxed 
`amo.{w


atomic_<op>_acquire 
`amo.{w


atomic_<op>_release 
`amo.{w


atomic_<op> 
`amo.{w


Linux Construct
RVWMO LR/SC Mapping


atomic_<op>_relaxed 
`loop:lr.{w


atomic_<op>_acquire 
`loop:lr.{w


atomic_<op>_release
`loop:lr.{w


`fence.tso; loop:lr.{w


atomic_<op> 
`loop:lr.{w


Mappings from Linux memory primitives to RISC-V primitives. Other
constructs (such as spinlocks) should follow accordingly. Platforms or
devices with non-coherent DMA may need additional synchronization (such
as cache flush or invalidate mechanisms); currently any such extra
synchronization will be device-specific.

Table 1.5 provides a mapping of Linux
memory ordering macros onto RISC-V memory instructions. The Linux fences
dma_rmb() and dma_wmb() map onto FENCE R,R and FENCE W,W,
respectively, since the RISC-V Unix Platform requires coherent DMA, but
would be mapped onto FENCE RI,RI and FENCE WO,WO, respectively, on a
platform with non-coherent DMA. Platforms with non-coherent DMA may also
require a mechanism by which cache lines can be flushed and/or
invalidated. Such mechanisms will be device-specific and/or standardized
in a future extension to the ISA.
The Linux mappings for release operations may seem stronger than
necessary, but these mappings are needed to cover some cases in which
Linux requires stronger orderings than the more intuitive mappings would
provide. In particular, as of the time this text is being written, Linux
is actively debating whether to require load-load, load-store, and
store-store orderings between accesses in one critical section and
accesses in a subsequent critical section in the same hart and protected
by the same synchronization object. Not all combinations of
FENCE RW,W/FENCE R,RW mappings with aq/rl mappings combine to
provide such orderings. There are a few ways around this problem,
including:


Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
suffices but is undesirable, as it defeats the purpose of the
aq/rl modifiers.


Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
not currently work due to the lack of load and store opcodes with
aq and rl modifiers.


Strengthen the mappings of release operations such that they would
enforce sufficient orderings in the presence of either type of
acquire mapping. This is the currently recommended solution, and the
one shown in
Table 1.5.


Linux code: (a) int r0 = *x; (bc) spin_unlock(y, 0); ... ... (d)
spin_lock(y); (e) int r1 = *z;

          
RVWMO Mapping: (a) lw a0, 0(s0) (b) fence.tso // vs. fence rw,w (c) sd
x0,0(s1) ... loop: (d) amoswap.d.aq a1,t1,0(s1) bnez a1,loop (e) lw
a2,0(s2)

For example, the critical section ordering rule currently being debated
by the Linux community would require (a) to be ordered before (e) in
Figure [fig:litmus:lkmm_ll]. If that
will indeed be required, then it would be insufficient for (b) to map as
FENCE RW,W. That said, these mappings are subject to change as the Linux
Kernel Memory Model evolves.


C/C++ Construct
RVWMO Mapping


Non-atomic load
`l{b


atomic_load(memory_order_relaxed) 
`l{b


atomic_load(memory_order_acquire) 
`l{b


atomic_load(memory_order_seq_cst) 
`fence rw,rw; l{b


Non-atomic store
`s{b


atomic_store(memory_order_relaxed) 
`s{b


atomic_store(memory_order_release) 
`fence rw,w; s{b


atomic_store(memory_order_seq_cst) 
`fence rw,w; s{b


atomic_thread_fence(memory_order_acquire) 
fence r,rw 


atomic_thread_fence(memory_order_release) 
fence rw,w 


atomic_thread_fence(memory_order_acq_rel) 
fence.tso


atomic_thread_fence(memory_order_seq_cst) 
fence rw,rw 


C/C++ Construct
RVWMO AMO Mapping


atomic_<op>(memory_order_relaxed) 
`amo.{w


atomic_<op>(memory_order_acquire) 
`amo.{w


atomic_<op>(memory_order_release) 
`amo.{w


atomic_<op>(memory_order_acq_rel) 
`amo.{w


atomic_<op>(memory_order_seq_cst) 
`amo.{w


C/C++ Construct
RVWMO LR/SC Mapping


atomic_<op>(memory_order_relaxed)
`loop:lr.{w


bnez loop 


atomic_<op>(memory_order_acquire)
`loop:lr.{w


bnez loop 


atomic_<op>(memory_order_release)
`loop:lr.{w


bnez loop 


atomic_<op>(memory_order_acq_rel)
`loop:lr.{w


bnez loop 


atomic_<op>(memory_order_seq_cst)
`loop:lr.{w


`sc.{w


Mappings from C/C++ primitives to RISC-V primitives.


C/C++ Construct
RVWMO Mapping


Non-atomic load
`l{b


atomic_load(memory_order_relaxed) 
`l{b


atomic_load(memory_order_acquire) 
`l{b


atomic_load(memory_order_seq_cst) 
`l{b


Non-atomic store
`s{b


atomic_store(memory_order_relaxed) 
`s{b


atomic_store(memory_order_release) 
`s{b


atomic_store(memory_order_seq_cst) 
`s{b


atomic_thread_fence(memory_order_acquire) 
fence r,rw 


atomic_thread_fence(memory_order_release) 
fence rw,w 


atomic_thread_fence(memory_order_acq_rel) 
fence.tso


atomic_thread_fence(memory_order_seq_cst) 
fence rw,rw 


C/C++ Construct
RVWMO AMO Mapping


atomic_<op>(memory_order_relaxed) 
`amo.{w


atomic_<op>(memory_order_acquire) 
`amo.{w


atomic_<op>(memory_order_release) 
`amo.{w


atomic_<op>(memory_order_acq_rel) 
`amo.{w


atomic_<op>(memory_order_seq_cst) 
`amo.{w


C/C++ Construct
RVWMO LR/SC Mapping


atomic_<op>(memory_order_relaxed) 
`lr.{w


atomic_<op>(memory_order_acquire) 
`lr.{w


atomic_<op>(memory_order_release) 
`lr.{w


atomic_<op>(memory_order_acq_rel) 
`lr.{w


atomic_<op>(memory_order_seq_cst) 
`lr.{w


^*must be `lr.{w
d}.aqrl` in order to interoperate with code mapped per Table <a href="#tab:c11mappings" data-reference-type="ref"


data-reference="tab:c11mappings">1.6


Hypothetical mappings from C/C++ primitives to RISC-V primitives, if
native load-acquire and store-release opcodes are introduced.

Table 1.6 provides a mapping of C11/C++11
atomic operations onto RISC-V memory instructions. If load and store
opcodes with aq and rl modifiers are introduced, then the mappings
in
Table 1.7 will suffice. Note
however that the two mappings only interoperate correctly if
atomic_<op>(memory_order_seq_cst) is mapped using an LR that has both
aq and rl set.
Any AMO can be emulated by an LR/SC pair, but care must be taken to
ensure that any PPO orderings that originate from the LR are also made
to originate from the SC, and that any PPO orderings that terminate at
the SC are also made to terminate at the LR. For example, the LR must
also be made to respect any data dependencies that the AMO has, given
that load operations do not otherwise have any notion of a data
dependency. Likewise, the effect a FENCE R,R elsewhere in the same hart
must also be made to apply to the SC, which would not otherwise respect
that fence. The emulator may achieve this effect by simply mapping AMOs
onto lr.aq; <op>; sc.aqrl, matching the mapping used elsewhere for
fully ordered atomics.
These C11/C++11 mappings require the platform to provide the following
Physical Memory Attributes (as defined in the RISC-V Privileged ISA) for
all memory:


main memory


coherent


AMOArithmetic


RsrvEventual


Platforms with different attributes may require different mappings, or
require platform-specific SW (e.g., memory-mapped I/O).
Implementation Guidelines

The RVWMO and RVTSO memory models by no means preclude
microarchitectures from employing sophisticated speculation techniques
or other forms of optimization in order to deliver higher performance.
The models also do not impose any requirement to use any one particular
cache hierarchy, nor even to use a cache coherence protocol at all.
Instead, these models only specify the behaviors that can be exposed to
software. Microarchitectures are free to use any pipeline design, any
coherent or non-coherent cache hierarchy, any on-chip interconnect,
etc., as long as the design only admits executions that satisfy the
memory model rules. That said, to help people understand the actual
implementations of the memory model, in this section we provide some
guidelines on how architects and programmers should interpret the
models’ rules.
Both RVWMO and RVTSO are multi-copy atomic (or
“other-multi-copy-atomic”): any store value that is visible to a hart
other than the one that originally issued it must also be conceptually
visible to all other harts in the system. In other words, harts may
forward from their own previous stores before those stores have become
globally visible to all harts, but no early inter-hart forwarding is
permitted. Multi-copy atomicity may be enforced in a number of ways. It
might hold inherently due to the physical design of the caches and store
buffers, it may be enforced via a single-writer/multiple-reader cache
coherence protocol, or it might hold due to some other mechanism.
Although multi-copy atomicity does impose some restrictions on the
microarchitecture, it is one of the key properties keeping the memory
model from becoming extremely complicated. For example, a hart may not
legally forward a value from a neighbor hart’s private store buffer
(unless of course it is done in such a way that no new illegal behaviors
become architecturally visible). Nor may a cache coherence protocol
forward a value from one hart to another until the coherence protocol
has invalidated all older copies from other caches. Of course,
microarchitectures may (and high-performance implementations likely
will) violate these rules under the covers through speculation or other
optimizations, as long as any non-compliant behaviors are not exposed to
the programmer.
As a rough guideline for interpreting the PPO rules in RVWMO, we expect
the following from the software perspective:


programmers will use PPO rules
|     <a href="#ppo:-|gt;st" data-reference-type="ref"
|     data-reference="ppo:-|gt;st">[ppo:-|gt;st] and
[ppo:fence]–[ppo:pair] regularly and actively.


expert programmers will use PPO rules
[ppo:addr]–[ppo:ctrl] to speed up critical paths
of important data structures.


even expert programmers will rarely if ever use PPO rules
[ppo:rdw]–[ppo:amoforward] and
[ppo:addrdatarfi]–[ppo:addrpo] directly. These are
included to facilitate common microarchitectural optimizations
(rule [ppo:rdw]) and the operational formal
modeling approach (rules
[ppo:amoforward] and
[ppo:addrdatarfi]–[ppo:addrpo]) described in
Section [sec:operational]. They also
facilitate the process of porting code from other architectures that
have similar rules.


We also expect the following from the hardware perspective:
| -   PPO rules <a href="#ppo:-|gt;st" data-reference-type="ref"
|     data-reference="ppo:-|gt;st">[ppo:-|gt;st] and
[ppo:amoforward]–[ppo:release] reflect
well-understood rules that should pose few surprises to architects.


PPO rule [ppo:rdw] reflects a natural and common
hardware optimization, but one that is very subtle and hence is
worth double checking carefully.


PPO rule [ppo:rcsc] may not be immediately
obvious to architects, but it is a standard memory model requirement


The load value axiom, the atomicity axiom, and PPO rules
[ppo:pair]–[ppo:addrpo] reflect rules that most
hardware implementations will enforce naturally, unless they contain
extreme optimizations. Of course, implementations should make sure
to double check these rules nevertheless. Hardware must also ensure
that syntactic dependencies are not “optimized away”.


Architectures are free to implement any of the memory model rules as
conservatively as they choose. For example, a hardware implementation
may choose to do any or all of the following:


interpret all fences as if they were FENCE RW,RW (or
FENCE IORW,IORW, if I/O is involved), regardless of the bits
actually set


implement all fences with PW and SR as if they were FENCE RW,RW (or
FENCE IORW,IORW, if I/O is involved), as PW with SR is the most
expensive of the four possible main memory ordering components
anyway


emulate aq and rl as described in
Section 1.5


enforcing all same-address load-load ordering, even in the presence
of patterns such as “fri-rfi” and “RSW”


forbid any forwarding of a value from a store in the store buffer to
a subsequent AMO or LR to the same address


forbid any forwarding of a value from an AMO or SC in the store
buffer to a subsequent load to the same address


implement TSO on all memory accesses, and ignore any main memory
fences that do not include PW and SR ordering (e.g., as Ztso
implementations will do)


implement all atomics to be RCsc or even fully ordered, regardless
of annotation


Architectures that implement RVTSO can safely do the following:


Ignore all fences that do not have both PW and SR (unless the fence
also orders I/O)


Ignore all PPO rules except for rules
[ppo:fence] through
[ppo:rcsc], since the rest are
redundant with other PPO rules under RVTSO assumptions


Other general notes:


Silent stores (i.e., stores that write the same value that already
exists at a memory location) behave like any other store from a
memory model point of view. Likewise, AMOs which do not actually
change the value in memory (e.g., an AMOMAX for which the value in
rs2 is smaller than the value currently in memory) are still
semantically considered store operations. Microarchitectures that
attempt to implement silent stores must take care to ensure that the
memory model is still obeyed, particularly in cases such as RSW
(Section 1.3.5) which tend to be
incompatible with silent stores.


Writes may be merged (i.e., two consecutive writes to the same
address may be merged) or subsumed (i.e., the earlier of two
back-to-back writes to the same address may be elided) as long as
the resulting behavior does not otherwise violate the memory model
semantics.


The question of write subsumption can be understood from the following
example:

m.4m.1m.4  


Hart 0

Hart 1


li t1, 3

li t3, 2


li t2, 1


(a)
sw t1,0(s0)
(d)
lw a0,0(s1)


(b)
fence w, w
(e)
sw a0,0(s0)


(c)
sw t2,0(s1)
(f)
sw t3,0(s0)


| | |

As written, if the load  (d) reads value 1, then (a) must precede (f) in
the global memory order:


(a) precedes (c) in the global memory order because of rule 2


(c) precedes (d) in the global memory order because of the Load
Value axiom


(d) precedes (e) in the global memory order because of rule 7


(e) precedes (f) in the global memory order because of rule 1


In other words the final value of the memory location whose address is
in s0 must be 2 (the value written by the store (f)) and cannot be 3
(the value written by the store (a)).
A very aggressive microarchitecture might erroneously decide to discard
(e), as (f) supersedes it, and this may in turn lead the
microarchitecture to break the now-eliminated dependency between (d) and
(f) (and hence also between (a) and (f)). This would violate the memory
model rules, and hence it is forbidden. Write subsumption may in other
cases be legal, if for example there were no data dependency between (d)
and (e).
Possible Future Extensions

We expect that any or all of the following possible future extensions
would be compatible with the RVWMO memory model:


‘V’ vector ISA extensions


‘J’ JIT extension


Native encodings for load and store opcodes with aq and rl set


Fences limited to certain addresses


Cache writeback/flush/invalidate/etc.instructions


Known Issues

Mixed-size RSW

 
Hart 0

Hart 1


li t1, 1

li t1, 1


(a)
lw a0,0(s0)
(d)
lw a1,0(s1)


(b)
fence rw,rw
(e)
amoswap.w.rl a2,t1,0(s2)


(c)
sw t1,0(s1)
(f)
ld a3,0(s2)


(g)
lw a4,4(s2)


xor a5,a4,a4


add s0,s0,a5


(h)
sw a2,0(s0)


Outcome: a0=1, a1=1, a2=0, a3=1, a4=0


Hart 0

Hart 1


li t1, 1

li t1, 1


(a)
lw a0,0(s0)
(d)
ld a1,0(s1)


(b)
fence rw,rw
(e)
lw a2,4(s1)


(c)
sw t1,0(s1)

xor a3,a2,a2


add s0,s0,a3


(f)
sw a2,0(s0)


Outcome: a0=0, a1=1, a2=0


Hart 0

Hart 1


li t1, 1

li t1, 1


(a)
lw a0,0(s0)
(d)
sw t1,4(s1)


(b)
fence rw,rw
(e)
ld a1,0(s1)


(c)
sw t1,0(s1)
(f)
lw a2,4(s1)


xor a3,a2,a2


add s0,s0,a3


(g)
sw a2,0(s0)


Outcome: a0=1, a1=0x100000001, a1=1


There is a known discrepancy between the operational and axiomatic
specifications within the family of mixed-size RSW variants shown in
Figures [fig:litmus:discrepancy:rsw1]–[fig:litmus:discrepancy:rsw3].
To address this, we may choose to add something like the following new
PPO rule: Memory operation a precedes memory operation b in
preserved program order (and hence also in the global memory order) if
a precedes b in program order, a and b both access regular main
memory (rather than I/O regions), a is a load, b is a store, there
is a load m between a and b, there is a byte x that both a and
m read, there is no store between a and m that writes to x, and
m precedes b in PPO. In other words, in herd syntax, we may choose
| to add “(po-loc | rsw);ppo;[W]” to PPO. Many implementations will
already enforce this ordering naturally. As such, even though this rule
is not official, we recommend that implementers enforce it nevertheless
in order to ensure forwards compatibility with the possible future
addition of this rule to RVWMO.
Formal Memory Model Specifications, Version 0.1

To facilitate formal analysis of RVWMO, this chapter presents a set of
formalizations using different tools and modeling approaches. Any
discrepancies are unintended; the expectation is that the models
describe exactly the same sets of legal behaviors.
This appendix should be treated as commentary; all normative material is
provided in Chapter [ch:memorymodel] and in the rest of
the main body of the ISA specification. All currently known
discrepancies are listed in
Section 1.7. Any other
discrepancies are unintentional.
Formal Axiomatic Specification in Alloy

We present a formal specification of the RVWMO memory model in Alloy
(http://alloy.mit.edu). This model is available online at
https://github.com/daniellustig/riscv-memory-model.
The online material also contains some litmus tests and some examples of
how Alloy can be used to model check some of the mappings in
Section [sec:memory:porting].
  
////////////////////////////////////////////////////////////////////////////////
// =RVWMO PPO=

// Preserved Program Order
fun ppo : Event->Event {
  // same-address ordering
  po_loc :> Store
  + rdw
  + (AMO + StoreConditional) <: rfi

  // explicit synchronization
  + ppo_fence
  + Acquire <: ^po :> MemoryEvent
  + MemoryEvent <: ^po :> Release
  + RCsc <: ^po :> RCsc
  + pair

  // syntactic dependencies
  + addrdep
  + datadep
  + ctrldep :> Store

  // pipeline dependencies
  + (addrdep+datadep).rfi
  + addrdep.^po :> Store
}

// the global memory order respects preserved program order
fact { ppo in ^gmo }

  
////////////////////////////////////////////////////////////////////////////////
// =RVWMO axioms=

// Load Value Axiom
fun candidates[r: MemoryEvent] : set MemoryEvent {

|       (r.~^gmo | Store | same_addr[r]) // writes preceding r in gmo
|       + (r.^~po | Store | same_addr[r]) // writes preceding r in po
}
fun latest_among[s: set Event] : Event { s - s.~^gmo }

pred LoadValue {
  all w: Store | all r: Load |
    w->r in rf <=> w = latest_among[candidates[r]]
}

// Atomicity Axiom
pred Atomicity {
  all r: Store.~pair |            // starting from the lr,

|         no x: Store | same_addr[r] |  // there is no store x to the same addr
x not in same_hart[r]       // such that x is from a different hart,
and x in r.~rf.^gmo         // x follows (the store r reads from) in gmo,
and r.pair in x.^gmo        // and r follows x in gmo
}
// Progress Axiom implicit: Alloy only considers finite executions

pred RISCV_mm { LoadValue and Atomicity /* and Progress */ }

  
////////////////////////////////////////////////////////////////////////////////
// Basic model of memory

sig Hart {  // hardware thread
  start : one Event
}
sig Address {}
abstract sig Event {
  po: lone Event // program order
}

abstract sig MemoryEvent extends Event {
  address: one Address,
  acquireRCpc: lone MemoryEvent,
  acquireRCsc: lone MemoryEvent,
  releaseRCpc: lone MemoryEvent,
  releaseRCsc: lone MemoryEvent,
  addrdep: set MemoryEvent,
  ctrldep: set Event,
  datadep: set MemoryEvent,
  gmo: set MemoryEvent,  // global memory order
  rf: set MemoryEvent
}
sig LoadNormal extends MemoryEvent {} // l{b|h|w|d}
sig LoadReserve extends MemoryEvent { // lr
  pair: lone StoreConditional
}
sig StoreNormal extends MemoryEvent {}       // s{b|h|w|d}
// all StoreConditionals in the model are assumed to be successful
sig StoreConditional extends MemoryEvent {}  // sc
sig AMO extends MemoryEvent {}               // amo
sig NOP extends Event {}

fun Load : Event { LoadNormal + LoadReserve + AMO }
fun Store : Event { StoreNormal + StoreConditional + AMO }

sig Fence extends Event {
  pr: lone Fence, // opcode bit
  pw: lone Fence, // opcode bit
  sr: lone Fence, // opcode bit
  sw: lone Fence  // opcode bit
}
sig FenceTSO extends Fence {}

/* Alloy encoding detail: opcode bits are either set (encoded, e.g.,
 * as f.pr in iden) or unset (f.pr not in iden).  The bits cannot be used for
 * anything else */
fact { pr + pw + sr + sw in iden }
// likewise for ordering annotations
fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden }
// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is
fact { no FenceTSO.(pr + pw + sr + sw) }

  
////////////////////////////////////////////////////////////////////////////////
// =Basic model rules=

// Ordering annotation groups
fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc }
fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc }
fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc }
fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc }

// There is no such thing as store-acquire or load-release, unless it's both

|     fact { Load | Release in Acquire }
|     fact { Store | Acquire in Release }
// FENCE PPO

|     fun FencePRSR : Fence { Fence.(pr | sr) }
|     fun FencePRSW : Fence { Fence.(pr | sw) }
|     fun FencePWSR : Fence { Fence.(pw | sr) }
|     fun FencePWSW : Fence { Fence.(pw | sw) }
fun ppo_fence : MemoryEvent->MemoryEvent {
    (Load  <: ^po :> FencePRSR).(^po :> Load)
  + (Load  <: ^po :> FencePRSW).(^po :> Store)
  + (Store <: ^po :> FencePWSR).(^po :> Load)
  + (Store <: ^po :> FencePWSW).(^po :> Store)
  + (Load  <: ^po :> FenceTSO) .(^po :> MemoryEvent)
  + (Store <: ^po :> FenceTSO) .(^po :> Store)
}

// auxiliary definitions

|     fun po_loc : Event->Event { ^po | address.~address }
fun same_hart[e: Event] : set Event { e + e.^~po + e.^po }
fun same_addr[e: Event] : set Event { e.address.~address }
// initial stores
fun NonInit : set Event { Hart.start.*po }
fun Init : set Event { Event - NonInit }
fact { Init in StoreNormal }

|     fact { Init->(MemoryEvent | NonInit) in ^gmo }
fact { all e: NonInit | one e.*~po.~start }  // each event is in exactly one hart
|     fact { all a: Address | one Init | a.~address } // one init store per address
fact { no Init <: po and no po :> Init }
  
// po
fact { acyclic[po] }

// gmo
fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents

//rf
fact { rf.~rf in iden } // each read returns the value of only one write
fact { rf in Store <: address.~address :> Load }

|     fun rfi : MemoryEvent->MemoryEvent { rf | (*po + *~po) }
//dep
fact { no StoreNormal <: (addrdep + ctrldep + datadep) }
fact { addrdep + ctrldep + datadep + pair in ^po }
fact { datadep in datadep :> Store }
fact { ctrldep.*po in ctrldep }

|     fact { no pair | (^po :> (LoadReserve + StoreConditional)).^po }
fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed
// rdw
fun rdw : Event->Event {
  (Load <: po_loc :> Load)  // start with all same_address load-load pairs,
  - (~rf.rf)                // subtract pairs that read from the same store,
  - (po_loc.rfi)            // and subtract out "fri-rfi" patterns
}

// filter out redundant instances and/or visualizations

|     fact { no gmo | gmo.gmo } // keep the visualization uncluttered
fact { all a: Address | some a.~address }
////////////////////////////////////////////////////////////////////////////////
// =Optional: opcode encoding restrictions=

// the list of blessed fences
fact { Fence in
  Fence.pr.sr
  + Fence.pw.sw
  + Fence.pr.pw.sw
  + Fence.pr.sr.sw
  + FenceTSO
  + Fence.pr.pw.sr.sw
}

pred restrict_to_current_encodings {

|       no (LoadNormal + StoreNormal) | (Acquire + Release)
}
////////////////////////////////////////////////////////////////////////////////
// =Alloy shortcuts=

|     pred acyclic[rel: Event->Event] { no iden | ^rel }
pred total[rel: Event->Event, bag: Event] {
all disj e, e': bag | e->e' in rel + ~rel
acyclic[rel]
}
Formal Axiomatic Specification in Herd

The tool herd takes a memory model and a
litmus test as input and simulates the execution of the test on top of
the memory model. Memory models are written in the domain specific
language Cat. This section provides two
Cat memory model of RVWMO. The first
model, Figure [fig:herd2], follows the global memory
order, Chapter [ch:memorymodel], definition
of RVWMO, as much as is possible for a Cat model. The second model,
Figure [fig:herd3], is an equivalent, more
efficient, partial order based RVWMO model.
The simulator herd is part of the diy tool suite — see http://diy.inria.fr for
software and documentation. The models and more are available online
at http://diy.inria.fr/cats7/riscv/.
  
(*************)
(* Utilities *)
(*************)

(* All fence relations *)
let fence.r.r = [R];fencerel(Fence.r.r);[R]
let fence.r.w = [R];fencerel(Fence.r.w);[W]
let fence.r.rw = [R];fencerel(Fence.r.rw);[M]
let fence.w.r = [W];fencerel(Fence.w.r);[R]
let fence.w.w = [W];fencerel(Fence.w.w);[W]
let fence.w.rw = [W];fencerel(Fence.w.rw);[M]
let fence.rw.r = [M];fencerel(Fence.rw.r);[R]
let fence.rw.w = [M];fencerel(Fence.rw.w);[W]
let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M]
let fence.tso =
  let f = fencerel(Fence.tso) in
  ([W];f;[W]) | ([R];f;[M])

let fence = 
  fence.r.r | fence.r.w | fence.r.rw |
  fence.w.r | fence.w.w | fence.w.rw |
  fence.rw.r | fence.rw.w | fence.rw.rw |
  fence.tso

(* Same address, no W to the same address in-between *)
let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc)
(* Read same write *)
let rsw = rf^-1;rf
(* Acquire, or stronger  *)
let AQ = Acq|AcqRel
(* Release or stronger *)
and RL = RelAcqRel
(* All RCsc *)
let RCsc = Acq|Rel|AcqRel
(* Amo events are both R and W, relation rmw relates paired lr/sc *)

|     let AMO = R | W
let StCond = range(rmw)
(*************)
(* ppo rules *)
(*************)

(* Overlapping-Address Orderings *)
let r1 = [M];po-loc;[W]
and r2 = ([R];po-loc-no-w;[R]) \ rsw
and r3 = [AMO|StCond];rfi;[R]
(* Explicit Synchronization *)
and r4 = fence
and r5 = [AQ];po;[M]
and r6 = [M];po;[RL]
and r7 = [RCsc];po;[RCsc]
and r8 = rmw
(* Syntactic Dependencies *)
and r9 = [M];addr;[M]
and r10 = [M];data;[W]
and r11 = [M];ctrl;[W]
(* Pipeline Dependencies *)
and r12 = [R];(addr|data);[W];rfi;[R]
and r13 = [R];addr;[M];po;[W]

let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13

  
Total

(* Notice that herd has defined its own rf relation *)

(* Define ppo *)
include "riscv-defs.cat"

(********************************)
(* Generate global memory order *)
(********************************)

let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *)

|       loc | (W\FW) * FW | # Final write after any write to the same location
ppo |               # ppo compatible
rfe                 # includes herd external rf (optimization)
(* Walk over all linear extensions of gmo0 *)
with  gmo from linearizations(M\IW,gmo0)

(* Add initial writes upfront -- convenient for computing rfGMO *)

|     let gmo = gmo | loc | IW * (M\IW)
(**********)
(* Axioms *)
(**********)

(* Compute rf according to the load value axiom, aka rfGMO *)

|     let WR = loc | ([W];(gmo|po);[R])
|     let rfGMO = WR \ (loc|([W];gmo);WR)
(* Check equality of herd rf and of rfGMO *)
empty (rf\rfGMO)|(rfGMO\rf) as RfCons

(* Atomicity axiom *)

|     let infloc = (gmo | loc)^-1
|     let inflocext = infloc | ext
|     let winside  = (infloc;rmw;inflocext) | (infloc;rf;rmw;inflocext) | [W]
empty winside as Atomic
  
Partial

(***************)
(* Definitions *)
(***************)

(* Define ppo *)
include "riscv-defs.cat"

(* Compute coherence relation *)
include "cos-opt.cat"

(**********)
(* Axioms *)
(**********)

(* Sc per location *)
acyclic co|rf|fr|po-loc as Coherence

(* Main model axiom *)
acyclic co|rfe|fr|ppo as Model

(* Atomicity axiom *)

|     empty rmw | (fre;coe) as Atomic
An Operational Memory Model

This is an alternative presentation of the RVWMO memory model in
operational style. It aims to admit exactly the same extensional
behavior as the axiomatic presentation: for any given program, admitting
an execution if and only if the axiomatic presentation allows it.
The axiomatic presentation is defined as a predicate on complete
candidate executions. In contrast, this operational presentation has an
abstract microarchitectural flavor: it is expressed as a state machine,
with states that are an abstract representation of hardware machine
states, and with explicit out-of-order and speculative execution (but
abstracting from more implementation-specific microarchitectural details
such as register renaming, store buffers, cache hierarchies, cache
protocols, etc.). As such, it can provide useful intuition. It can also
construct executions incrementally, making it possible to interactively
and randomly explore the behavior of larger examples, while the
axiomatic model requires complete candidate executions over which the
axioms can be checked.
The operational presentation covers mixed-size execution, with
potentially overlapping memory accesses of different power-of-two byte
sizes. Misaligned accesses are broken up into single-byte accesses.
The operational model, together with a fragment of the RISC-V ISA
semantics (RV64I and A), are integrated into the rmem exploration tool
(https://github.com/rems-project/rmem). rmem can explore litmus
tests (see [sec:litmustests]) and small ELF
binaries exhaustively, pseudo-randomly and interactively. In rmem, the
ISA semantics is expressed explicitly in Sail (see
https://github.com/rems-project/sail for the Sail language, and
https://github.com/rems-project/sail-riscv for the RISC-V ISA model),
and the concurrency semantics is expressed in Lem (see
https://github.com/rems-project/lem for the Lem language).
rmem has a command-line interface and a web-interface. The
web-interface runs entirely on the client side, and is provided online
together with a library of litmus tests:
http://www.cl.cam.ac.uk/~pes20/rmem. The command-line interface is
faster than the web-interface, specially in exhaustive mode.
Below is an informal introduction of the model states and transitions.
The description of the formal model starts in the next subsection.
Terminology: In contrast to the axiomatic presentation, here every
memory operation is either a load or a store. Hence, AMOs give rise to
two distinct memory operations, a load and a store. When used in
conjunction with “instruction”, the terms “load” and “store” refer to
instructions that give rise to such memory operations. As such, both
include AMO instructions. The term “acquire” refers to an instruction
(or its memory operation) with the acquire-RCpc or acquire-RCsc
annotation. The term “release” refers to an instruction (or its memory
operation) with the release-RCpc or release-RCsc annotation.
Model states

A model state consists of a shared memory and a tuple of hart states.


Hart 0
…
Hart n


$\big\uparrow$ $\big\downarrow$


$\big\uparrow$ $\big\downarrow$


Shared Memory


The shared memory state records all the memory store operations that
have propagated so far, in the order they propagated (this can be made
more efficient, but for simplicity of the presentation we keep it this
way).
Each hart state consists principally of a tree of instruction instances,
some of which have been finished, and some of which have not.
Non-finished instruction instances can be subject to restart, e.g. if
they depend on an out-of-order or speculative load that turns out to be
unsound.
Conditional branch and indirect jump instructions may have multiple
successors in the instruction tree. When such instruction is finished,
any un-taken alternative paths are discarded.
Each instruction instance in the instruction tree has a state that
includes an execution state of the intra-instruction semantics (the ISA
pseudocode for this instruction). The model uses a formalization of the
intra-instruction semantics in Sail. One can think of the execution
state of an instruction as a representation of the pseudocode control
state, pseudocode call stack, and local variable values. An instruction
instance state also includes information about the instance’s memory and
register footprints, its register reads and writes, its memory
operations, whether it is finished, etc.
Model transitions

The model defines, for any model state, the set of allowed transitions,
each of which is a single atomic step to a new abstract machine state.
Execution of a single instruction will typically involve many
transitions, and they may be interleaved in operational-model execution
with transitions arising from other instructions. Each transition arises
from a single instruction instance; it will change the state of that
instance, and it may depend on or change the rest of its hart state and
the shared memory state, but it does not depend on other hart states,
and it will not change them. The transitions are introduced below and
defined in
Section 1.5, with a precondition and a
construction of the post-transition model state for each.
Transitions for all instructions:


: This transition represents a fetch and decode of a new instruction
instance, as a program order successor of a previously fetched
instruction instance (or the initial fetch address).
The model assumes the instruction memory is fixed; it does not
describe the behavior of self-modifying code. In particular, the
transition does not generate memory load operations, and the shared
memory is not involved in the transition. Instead, the model depends
on an external oracle that provides an opcode when given a memory
location.


: This is a write of a register value.


: This is a read of a register value from the most recent
program-order-predecessor instruction instance that writes to that
register.


: This covers pseudocode internal computation: arithmetic, function
calls, etc.


: At this point the instruction pseudocode is done, the instruction
cannot be restarted, memory accesses cannot be discarded, and all
memory effects have taken place. For conditional branch and indirect
jump instructions, any program order successors that were fetched
from an address that is not the one that was written to the pc
register are discarded, together with the sub-tree of instruction
instances below them.


Transitions specific to load instructions:


: At this point the memory footprint of the load instruction is
provisionally known (it could change if earlier instructions are
restarted) and its individual memory load operations can start being
satisfied.


: This partially or entirely satisfies a single memory load
operation by forwarding, from program-order-previous memory store
operations.


: This entirely satisfies the outstanding slices of a single memory
load operation, from memory.


: At this point all the memory load operations of the instruction
have been entirely satisfied and the instruction pseudocode can
continue executing. A load instruction can be subject to being
restarted until the transition. But, under some conditions, the
model might treat a load instruction as non-restartable even before
it is finished (e.g. see ).


Transitions specific to store instructions:


: At this point the memory footprint of the store is provisionally
known.


: At this point the memory store operations have their values and
program-order-successor memory load operations can be satisfied by
forwarding from them.


: At this point the store operations are guaranteed to happen (the
instruction can no longer be restarted or discarded), and they can
start being propagated to memory.


: This propagates a single memory store operation to memory.


: At this point all the memory store operations of the instruction
have been propagated to memory, and the instruction pseudocode can
continue executing.


Transitions specific to sc instructions:


: This causes the sc to fail, either a spontaneous fail or because
it is not paired with a program-order-previous lr.


: This transition indicates the sc is paired with an lr and
might succeed.


: This is an atomic execution of the transitions and , it is enabled
only if the stores from which the lr read from have not been
overwritten.


: This causes the sc to fail, either a spontaneous fail or because
the stores from which the lr read from have been overwritten.


Transitions specific to AMO instructions:

: This is an atomic execution of all the transitions needed to
satisfy the load operation, do the required arithmetic, and
propagate the store operation.

Transitions specific to fence instructions:


The transitions labeled ∘ can always be taken eagerly, as soon as their
precondition is satisfied, without excluding other behavior; the •
cannot. Although is marked with a •, it can be taken eagerly as long as
it is not taken infinitely many times.
An instance of a non-AMO load instruction, after being fetched, will
typically experience the following transitions in this order:


and/or (as many as needed to satisfy all the load operations of the
instance)


Before, between and after the transitions above, any number of
transitions may appear. In addition, a transition for fetching the
instruction in the next program location will be available until it is
taken.
This concludes the informal description of the operational model. The
following sections describe the formal operational model.
Intra-instruction Pseudocode Execution

The intra-instruction semantics for each instruction instance is
expressed as a state machine, essentially running the instruction
pseudocode. Given a pseudocode execution state, it computes the next
state. Most states identify a pending memory or register operation,
requested by the pseudocode, which the memory model has to do. The
states are (this is a tagged union; tags in small-caps):


Load_mem(kind, address, size, load_continuation)
memory load operation


Early_sc_fail(res_continuation)
allow sc to fail early


Store_ea(kind, address, size, next_state)
memory store effective address


Store_memv(mem_value, store_continuation)
memory store value


Fence(kind, next_state)
fence


Read_reg(reg_name, read_continuation)
register read


Write_reg(reg_name, reg_value, next_state)
register write


Internal(next_state)
pseudocode internal step


Done
end of pseudocode


Here:

mem_value and reg_value are lists of bytes;
address is an integer of XLEN bits;
for load/store, kind identifies whether it is lr/sc,
acquire-RCpc/release-RCpc, acquire-RCsc/release-RCsc,
acquire-release-RCsc;
for fence, kind identifies whether it is a normal or TSO, and (for
normal fences) the predecessor and successor ordering bits;
reg_name identifies a register and a slice thereof (start and end bit
indices); and
the continuations describe how the instruction instance will continue
for each value that might be provided by the surrounding memory model
(the load_continuation and read_continuation take the value loaded
from memory and read from the previous register write, the
store_continuation takes false for an sc that failed and true in
all other cases, and res_continuation takes false if the sc fails
and true otherwise).


For example, given the load instruction lw x1,0(x2), an execution will
typically go as follows. The initial execution state will be computed
from the pseudocode for the given opcode. This can be expected to be
Read_reg(x2, read_continuation). Feeding the most recently written
value of register x2 (the instruction semantics will be blocked if
necessary until the register value is available), say 0x4000, to
read_continuation returns Load_mem(plain_load, 0x4000, 4,
load_continuation). Feeding the 4-byte value loaded from memory
location 0x4000, say 0x42, to load_continuation returns
Write_reg(x1, 0x42, Done). Many Internal(next_state) states may
appear before and between the states above.

Notice that writing to memory is split into two steps, Store_ea and
Store_memv: the first one makes the memory footprint of the store
provisionally known, and the second one adds the value to be stored. We
ensure these are paired in the pseudocode (Store_ea followed by
Store_memv), but there may be other steps between them.

It is observable that the Store_ea can occur before the value to be
stored is determined. For example, for the litmus test
LB+fence.r.rw+data-po to be allowed by the operational model (as it is
by RVWMO), the first store in Hart 1 has to take the Store_ea step
before its value is determined, so that the second store can see it is
to a non-overlapping memory footprint, allowing the second store to be
committed out of order without violating coherence.

The pseudocode of each instruction performs at most one store or one
load, except for AMOs that perform exactly one load and one store. Those
memory accesses are then split apart into the architecturally atomic
units by the hart semantics (see and below).
Informally, each bit of a register read should be satisfied from a
register write by the most recent (in program order) instruction
instance that can write that bit (or from the hart’s initial register
state if there is no such write). Hence, it is essential to know the
register write footprint of each instruction instance, which we
calculate when the instruction instance is created (see the action of
below). We ensure in the pseudocode that each instruction does at most
one register write to each register bit, and also that it does not try
to read a register value it just wrote.
Data-flow dependencies (address and data) in the model emerge from the
fact that each register read has to wait for the appropriate register
write to be executed (as described above).
Instruction Instance State

Each instruction instance i has a state comprising:


program_loc, the memory address from which the instruction was
fetched;


instruction_kind, identifying whether this is a load, store, AMO,
fence, branch/jump or a ‘simple’ instruction (this also includes a
kind similar to the one described for the pseudocode execution
states);


src_regs, the set of source reg_names (including system
registers), as statically determined from the pseudocode of the
instruction;


dst_regs, the destination reg_names (including system
registers), as statically determined from the pseudocode of the
instruction;


pseudocode_state (or sometimes just ‘state’ for short), one of
(this is a tagged union; tags in small-caps):


Plain(isa_state)
ready to make a pseudocode transition


Pending_mem_loads(load_continuation)
requesting memory load operation(s)


Pending_mem_stores(store_continuation)
requesting memory store operation(s)


reg_reads, the register reads the instance has performed,
including, for each one, the register write slices it read from;


reg_writes, the register writes the instance has performed;


mem_loads, a set of memory load operations, and for each one the
as-yet-unsatisfied slices (the byte indices that have not been
satisfied yet), and, for the satisfied slices, the store slices
(each consisting of a memory store operation and subset of its byte
indices) that satisfied it.


mem_stores, a set of memory store operations, and for each one a
flag that indicates whether it has been propagated (passed to the
shared memory) or not.


information recording whether the instance is committed, finished,
etc.


Each memory load operation includes a memory footprint (address and
size). Each memory store operations includes a memory footprint, and,
when available, a value.
A load instruction instance with a non-empty mem_loads, for which all
the load operations are satisfied (i.e. there are no unsatisfied load
slices) is said to be entirely satisfied.
Informally, an instruction instance is said to have fully determined
data if the load (and sc) instructions feeding its source registers
are finished. Similarly, it is said to have a fully determined memory
footprint if the load (and sc) instructions feeding its memory
operation address register are finished. Formally, we first define the
notion of fully determined register write: a register write w from
reg_writes of instruction instance i is said to be fully
determined if one of the following conditions hold:


i is finished; or


the value written by w is not affected by a memory operation that
i has made (i.e. a value loaded from memory or the result of
sc), and, for every register read that i has made, that affects
w, the register write from which i read is fully determined (or
i read from the initial register state).


Now, an instruction instance i is said to have fully determined data
if for every register read r from reg_reads, the register writes
that r reads from are fully determined. An instruction instance i is
said to have a fully determined memory footprint if for every register
read r from reg_reads that feeds into i’s memory operation
address, the register writes that r reads from are fully determined.

The rmem tool records, for every register write, the set of register
writes from other instructions that have been read by this instruction
at the point of performing the write. By carefully arranging the
pseudocode of the instructions covered by the tool we were able to make
it so that this is exactly the set of register writes on which the write
depends on.

Hart State

The model state of a single hart comprises:


hart_id, a unique identifier of the hart;


initial_register_state, the initial register value for each
register;


initial_fetch_address, the initial instruction fetch address;


instruction_tree, a tree of the instruction instances that have
been fetched (and not discarded), in program order.


Shared Memory State

The model state of the shared memory comprises a list of memory store
operations, in the order they propagated to the shared memory.
When a store operation is propagated to the shared memory it is simply
added to the end of the list. When a load operation is satisfied from
memory, for each byte of the load operation, the most recent
corresponding store slice is returned.

For most purposes, it is simpler to think of the shared memory as an
array, i.e., a map from memory locations to memory store operation
slices, where each memory location is mapped to a one-byte slice of the
most recent memory store operation to that location. However, this
abstraction is not detailed enough to properly handle the sc
instruction. The RVWMO allows store operations from the same hart as the
sc to intervene between the store operation of the sc and the store
operations the paired lr read from. To allow such store operations to
intervene, and forbid others, the array abstraction must be extended to
record more information. Here, we use a list as it is very simple, but a
more efficient and scalable implementations should probably use
something better.

Transitions

Each of the paragraphs below describes a single kind of system
transition. The description starts with a condition over the current
system state. The transition can be taken in the current state only if
the condition is satisfied. The condition is followed by an action that
is applied to that state when the transition is taken, in order to
generate the new system state.
Fetch instruction

A possible program-order-successor of instruction instance i can be
fetched from address loc if:


it has not already been fetched, i.e., none of the immediate
successors of i in the hart’s instruction_tree are from loc;
and


if i’s pseudocode has already written an address to pc, then
loc must be that address, otherwise loc is:


for a conditional branch, the successor address or the branch
target address;


for a (direct) jump and link instruction (jal), the target
address;


for an indirect jump instruction (jalr), any address; and


for any other instruction, i.program_loc + 4.


Action: construct a freshly initialized instruction instance i′ for
the instruction in the program memory at loc, with state
Plain(isa_state), computed from the instruction pseudocode, including
the static information available from the pseudocode such as its
instruction_kind, src_regs, and dst_regs, and add i′ to the
hart’s instruction_tree as a successor of i.

The possible next fetch addresses (loc) are available immediately
after fetching i and the model does not need to wait for the
pseudocode to write to pc; this allows out-of-order execution, and
speculation past conditional branches and jumps. For most instructions
these addresses are easily obtained from the instruction pseudocode. The
only exception to that is the indirect jump instruction (jalr), where
the address depends on the value held in a register. In principle the
mathematical model should allow speculation to arbitrary addresses here.
The exhaustive search in the rmem tool handles this by running the
exhaustive search multiple times with a growing set of possible next
fetch addresses for each indirect jump. The initial search uses empty
sets, hence there is no fetch after indirect jump instruction until the
pseudocode of the instruction writes to pc, and then we use that value
for fetching the next instruction. Before starting the next iteration of
exhaustive search, we collect for each indirect jump (grouped by code
location) the set of values it wrote to pc in all the executions in
the previous search iteration, and use that as possible next fetch
addresses of the instruction. This process terminates when no new fetch
addresses are detected.

Initiate memory load operations

An instruction instance i in state Plain(Load_mem(kind, address,
size, load_continuation)) can always initiate the corresponding
memory load operations. Action:


Construct the appropriate memory load operations mlo**s:


if address is aligned to size then mlo**s is a single
memory load operation of size bytes from address;


otherwise, mlo**s is a set of size memory load
operations, each of one byte, from the addresses
address…address + size − 1.


set mem_loads of i to mlo**s; and


update the state of i to Pending_mem_loads(load_continuation).


In Section [sec:rvwmo:primitives] it is
said that misaligned memory accesses may be decomposed at any
granularity. Here we decompose them to one-byte accesses as this
granularity subsumes all others.

Satisfy memory load operation by forwarding from unpropagated stores

For a non-AMO load instruction instance i in state
Pending_mem_loads(load_continuation), and a memory load operation
mlo in i.mem_loads that has unsatisfied slices, the memory
load operation can be partially or entirely satisfied by forwarding from
unpropagated memory store operations by store instruction instances that
are program-order-before i if:


all program-order-previous fence instructions with .sr and .pw
set are finished;


for every program-order-previous fence instruction, f, with
.sr and .pr set, and .pw not set, if f is not finished then
all load instructions that are program-order-before f are entirely
satisfied;


for every program-order-previous fence.tso instruction, f, that
is not finished, all load instructions that are program-order-before
f are entirely satisfied;


if i is a load-acquire-RCsc, all program-order-previous
store-releases-RCsc are finished;


if i is a load-acquire-release, all program-order-previous
instructions are finished;


all non-finished program-order-previous load-acquire instructions
are entirely satisfied; and


all program-order-previous store-acquire-release instructions are
finished;


Let msoss be the set of all unpropagated memory store
operation slices from non-sc store instruction instances that are
program-order-before i and have already calculated the value to be
stored, that overlap with the unsatisfied slices of mlo, and which
are not superseded by intervening store operations or store operations
that are read from by an intervening load. The last condition requires,
for each memory store operation slice mso**s in msoss
from instruction i′:

that there is no store instruction program-order-between i and i′
with a memory store operation overlapping mso**s; and
that there is no load instruction program-order-between i and i′
that was satisfied from an overlapping memory store operation slice from
a different hart.

Action:


update i.mem_loads to indicate that mlo was satisfied by
msoss; and


restart any speculative instructions which have violated coherence
as a result of this, i.e., for every non-finished instruction i′
that is a program-order-successor of i, and every memory load
operation mlo′ of i′ that was satisfied from
msoss′, if there exists a memory store operation slice
mso**s′ in msoss′, and an overlapping memory store
operation slice from a different memory store operation in
msoss, and mso**s′ is not from an instruction that
is a program-order-successor of i, restart i′ and its
restart-dependents.


Where, the restart-dependents of instruction j are:

program-order-successors of j that have data-flow dependency on a
register write of j;
program-order-successors of j that have a memory load operation that
reads from a memory store operation of j (by forwarding);
if j is a load-acquire, all the program-order-successors of j;
if j is a load, for every fence, f, with .sr and .pr set, and
.pw not set, that is a program-order-successor of j, all the load
instructions that are program-order-successors of f;
if j is a load, for every fence.tso, f, that is a
program-order-successor of j, all the load instructions that are
program-order-successors of f; and
(recursively) all the restart-dependents of all the instruction
instances above.


Forwarding memory store operations to a memory load might satisfy only
some slices of the load, leaving other slices unsatisfied.
A program-order-previous store operation that was not available when
taking the transition above might make msoss provisionally
unsound (violating coherence) when it becomes available. That store will
prevent the load from being finished (see ), and will cause it to
restart when that store operation is propagated (see ).
A consequence of the transition condition above is that
store-release-RCsc memory store operations cannot be forwarded to
load-acquire-RCsc instructions: msoss does not include memory
store operations from finished stores (as those must be propagated
memory store operations), and the condition above requires all
program-order-previous store-releases-RCsc to be finished when the load
is acquire-RCsc.

Satisfy memory load operation from memory

For an instruction instance i of a non-AMO load instruction or an AMO
instruction in the context of the “” transition, any memory load
operation mlo in i.mem_loads that has unsatisfied slices, can
be satisfied from memory if all the conditions of are satisfied. Action:
let msoss be the memory store operation slices from memory
covering the unsatisfied slices of mlo, and apply the action of .

Note that might leave some slices of the memory load operation
unsatisfied, those will have to be satisfied by taking the transition
again, or taking . , on the other hand, will always satisfy all the
unsatisfied slices of the memory load operation.

Complete load operations

A load instruction instance i in state
Pending_mem_loads(load_continuation) can be completed (not to be
confused with finished) if all the memory load operations
i.mem_loads are entirely satisfied (i.e. there are no unsatisfied
slices). Action: update the state of i to
Plain(load_continuation(mem_value)), where mem_value is assembled
from all the memory store operation slices that satisfied
i.mem_loads.
Early sc fail

An sc instruction instance i in state
Plain(Early_sc_fail(res_continuation)) can always be made to fail.
Action: update the state of i to Plain(res_continuation(false)).
Paired sc

An sc instruction instance i in state
Plain(Early_sc_fail(res_continuation)) can continue its (potentially
successful) execution if i is paired with an lr. Action: update the
state of i to Plain(res_continuation(true)).
Initiate memory store operation footprints

An instruction instance i in state Plain(Store_ea(kind, address,
size, next_state)) can always announce its pending memory store
operation footprint. Action:


construct the appropriate memory store operations mso**s
(without the store value):


if address is aligned to size then mso**s is a single
memory store operation of size bytes to address;


otherwise, mso**s is a set of size memory store
operations, each of one-byte size, to the addresses
address…address + size − 1.


set i.mem_stores to mso**s; and


update the state of i to Plain(next_state).


Note that after taking the transition above the memory store operations
do not yet have their values. The importance of splitting this
transition from the transition below is that it allows other
program-order-successor store instructions to observe the memory
footprint of this instruction, and if they don’t overlap, propagate out
of order as early as possible (i.e. before the data register value
becomes available).

Instantiate memory store operation values

An instruction instance i in state Plain(Store_memv(mem_value,
store_continuation)) can always instantiate the values of the memory
store operations i.mem_stores. Action:


split mem_value between the memory store operations
i.mem_stores; and


update the state of i to Pending_mem_stores(store_continuation).


Commit store instruction

An uncommitted instruction instance i of a non-sc store instruction
or an sc instruction in the context of the “” transition, in state
Pending_mem_stores(store_continuation), can be committed (not to be
confused with propagated) if:


i has fully determined data;


all program-order-previous conditional branch and indirect jump
instructions are finished;


all program-order-previous fence instructions with .sw set are
finished;


all program-order-previous fence.tso instructions are finished;


all program-order-previous load-acquire instructions are finished;


all program-order-previous store-acquire-release instructions are
finished;


if i is a store-release, all program-order-previous instructions
are finished;


 all
program-order-previous memory access instructions have a fully
determined memory footprint;


 all
program-order-previous store instructions, except for sc that
failed, have initiated and so have non-empty mem_stores; and


 all
program-order-previous load instructions have initiated and so have
non-empty mem_loads.


Action: record that i is committed.

Notice that if condition
[omm:commit_store:prev_addrs]
is satisfied the conditions
[omm:commit_store:prev_stores]
and [omm:commit_store:prev_loads]
are also satisfied, or will be satisfied after taking some eager
transitions. Hence, requiring them does not strengthen the model. By
requiring them, we guarantee that previous memory access instructions
have taken enough transitions to make their memory operations visible
for the condition check of , which is the next transition the
instruction will take, making that condition simpler.

Propagate store operation

For a committed instruction instance i in state
Pending_mem_stores(store_continuation), and an unpropagated memory
store operation mso in i.mem_stores, mso can be
propagated if:


all memory store operations of program-order-previous store
instructions that overlap with mso have already propagated;


all memory load operations of program-order-previous load
instructions that overlap with mso have already been
satisfied, and (the load instructions) are non-restartable (see
definition below); and


all memory load operations that were satisfied by forwarding
mso are entirely satisfied.


Where a non-finished instruction instance j is non-restartable if:


there does not exist a store instruction s and an unpropagated
memory store operation mso of s such that applying the
action of the “” transition to mso will result in the restart
of j; and


there does not exist a non-finished load instruction l and a
memory load operation mlo of l such that applying the action
of the “”/“” transition (even if mlo is already satisfied) to
mlo will result in the restart of j.


Action:


update the shared memory state with mso;


update i.mem_stores to indicate that mso was propagated;
and


restart any speculative instructions which have violated coherence
as a result of this, i.e., for every non-finished instruction i′
program-order-after i and every memory load operation mlo′
of i′ that was satisfied from msoss′, if there exists a
memory store operation slice mso**s′ in msoss′ that
overlaps with mso and is not from mso, and mso**s′
is not from a program-order-successor of i, restart i′ and its
restart-dependents (see ).


Commit and propagate store operation of an sc

An uncommitted sc instruction instance i, from hart h, in state
Pending_mem_stores(store_continuation), with a paired lr i′ that
has been satisfied by some store slices msoss, can be
committed and propagated at the same time if:


i′ is finished;


every memory store operation that has been forwarded to i′ is
propagated;


the conditions of is satisfied;


the conditions of is satisfied (notice that an sc instruction can
only have one memory store operation); and


for every store slice mso**s from msoss,
mso**s has not been overwritten, in the shared memory, by a
store that is from a hart that is not h, at any point since
mso**s was propagated to memory.


Action:


apply the actions of ; and


apply the action of .


Late sc fail

An sc instruction instance i in state
Pending_mem_stores(store_continuation), that has not propagated its
memory store operation, can always be made to fail. Action:


clear i.mem_stores; and


update the state of i to Plain(store_continuation(false)).


For efficiency, the rmem tool allows this transition only when it is
not possible to take the transition. This does not affect the set of
allowed final states, but when explored interactively, if the sc
should fail one should use the transition instead of waiting for this
transition.

Complete store operations

A store instruction instance i in state
Pending_mem_stores(store_continuation), for which all the memory store
operations in i.mem_stores have been propagated, can always be
completed (not to be confused with finished). Action: update the state
of i to Plain(store_continuation(true)).
Satisfy, commit and propagate operations of an AMO

An AMO instruction instance i in state
Pending_mem_loads(load_continuation) can perform its memory access if
it is possible to perform the following sequence of transitions with no
intervening transitions:


(zero or more times)


and in addition, the condition of , with the exception of not requiring
i to be in state Plain(Done), holds after those transitions. Action:
perform the above sequence of transitions (this does not include ), one
after the other, with no intervening transitions.

Notice that program-order-previous stores cannot be forwarded to the
load of an AMO. This is simply because the sequence of transitions above
does not include the forwarding transition. But even if it did include
it, the sequence will fail when trying to do the transition, as this
transition requires all program-order-previous store operations to
overlapping memory footprints to be propagated, and forwarding requires
the store operation to be unpropagated.
In addition, the store of an AMO cannot be forwarded to a
program-order-successor load. Before taking the transition above, the
store operation of the AMO does not have its value and therefore cannot
be forwarded; after taking the transition above the store operation is
propagated and therefore cannot be forwarded.

Commit fence

A fence instruction instance i in state Plain(Fence(kind,
next_state)) can be committed if:


if i is a normal fence and it has .pr set, all
program-order-previous load instructions are finished;


if i is a normal fence and it has .pw set, all
program-order-previous store instructions are finished; and


if i is a fence.tso, all program-order-previous load and store
instructions are finished.


Action:


record that i is committed; and


update the state of i to Plain(next_state).


Register read

An instruction instance i in state Plain(Read_reg(reg_name,
read_cont)) can do a register read of reg_name if every instruction
instance that it needs to read from has already performed the expected
reg_name register write.
Let read_sources include, for each bit of reg_name, the write to
that bit by the most recent (in program order) instruction instance that
can write to that bit, if any. If there is no such instruction, the
source is the initial register value from initial_register_state. Let
reg_value be the value assembled from read_sources. Action:


add reg_name to i.reg_reads with read_sources and
reg_value; and


update the state of i to Plain(read_cont(reg_value)).


Register write

An instruction instance i in state Plain(Write_reg(reg_name,
reg_value, next_state)) can always do a reg_name register write.
Action:


add reg_name to i.reg_writes with dep**s and
reg_value; and


update the state of i to Plain(next_state).


where dep**s is a pair of the set of all read_sources from
i.reg_reads, and a flag that is true iff i is a load instruction
instance that has already been entirely satisfied.
Pseudocode internal step

An instruction instance i in state Plain(Internal(next_state)) can
always do that pseudocode-internal step. Action: update the state of i
to Plain(next_state).
Finish instruction

A non-finished instruction instance i in state Plain(Done) can be
finished if:


if i is a load instruction:


all program-order-previous load-acquire instructions are
finished;


all program-order-previous fence instructions with .sr set
are finished;


for every program-order-previous fence.tso instruction, f,
that is not finished, all load instructions that are
program-order-before f are finished; and


it is guaranteed that the values read by the memory load
operations of i will not cause coherence violations, i.e., for
any program-order-previous instruction instance i′, let cfp
be the combined footprint of propagated memory store operations
from store instructions program-order-between i and i′, and
fixed memory store operations that were forwarded to i from
store instructions program-order-between i and i′ including
i′, and let $\overline{\textit{cfp}}$ be the complement of
cfp in the memory footprint of i. If
$\overline{\textit{cfp}}$ is not empty:


i′ has a fully determined memory footprint;


i′ has no unpropagated memory store operations that
overlap with $\overline{\textit{cfp}}$; and


if i′ is a load with a memory footprint that overlaps with
$\overline{\textit{cfp}}$, then all the memory load
operations of i′ that overlap with
$\overline{\textit{cfp}}$ are satisfied and i′ is
non-restartable (see the transition for how to determined
if an instruction is non-restartable).


Here, a memory store operation is called fixed if the store
instruction has fully determined data.


i has a fully determined data; and


if i is not a fence, all program-order-previous conditional branch
and indirect jump instructions are finished.


Action:


if i is a conditional branch or indirect jump instruction, discard
any untaken paths of execution, i.e., remove all instruction
instances that are not reachable by the branch/jump taken in
instruction_tree; and


record the instruction as finished, i.e., set finished to true.


Limitations


The model covers user-level RV64I and RV64A. In particular, it does
not support the misaligned atomics extension “Zam” or the total
store ordering extension “Ztso”. It should be trivial to adapt the
model to RV32I/A and to the G, Q and C extensions, but we have never
tried it. This will involve, mostly, writing Sail code for the
instructions, with minimal, if any, changes to the concurrency
model.


The model covers only normal memory accesses (it does not handle I/O
accesses).


The model does not cover TLB-related effects.


The model assumes the instruction memory is fixed. In particular,
the transition does not generate memory load operations, and the
shared memory is not involved in the transition. Instead, the model
depends on an external oracle that provides an opcode when given a
memory location.


The model does not cover exceptions, traps and interrupts.
Contributors to all versions of the spec in alphabetical order (please
contact editors to suggest corrections): Krste Asanović, Peter Ashenden,
Rimas Avižienis, Jacob Bachmeyer, Allen J. Baum, Jonathan Behrens, Paolo
Bonzini, Ruslan Bukin, Christopher Celio, Chuanhua Chang, David
Chisnall, Anthony Coulter, Palmer Dabbelt, Monte Dalrymple, Paul
Donahue, Greg Favor, Dennis Ferguson, Marc Gauthier, Andy Glew, Gary
Guo, Mike Frysinger, John Hauser, David Horner, Olof Johansson, David
Kruckemyer, Yunsup Lee, Daniel Lustig, Andrew Lutomirski, Prashanth
Mundkur, Jonathan Neuschäfer, Rishiyur Nikhil, Stefan O’Rear, Albert Ou,
John Ousterhout, David Patterson, Dmitri Pavlov, Kade Phillips, Josh
Scheid, Colin Schmidt, Michael Taylor, Wesley Terpstra, Matt Thomas,
Tommy Thorn, Ray VanDeWalker, Megan Wachs, Steve Wallach, Andrew
Waterman, Claire Wolf, and Reinoud Zandijk.


This document is released under a Creative Commons Attribution 4.0
International License.
This document is a derivative of the RISC-V privileged specification
version 1.9.1 released under following license: © 2010–2017 Andrew
Waterman, Yunsup Lee, Rimas Avižienis, David Patterson, Krste Asanović.
Creative Commons Attribution 4.0 International License.
Please cite as: “The RISC-V Instruction Set Manual, Volume II:
Privileged Architecture, Document Version 20211203”, Editors Andrew
Waterman, Krste Asanović, and John Hauser, RISC-V International,
December 2021.
Volume II: RISC-V Privileged Architectures V20211203
Preface

This document describes the RISC-V privileged architecture. This
release, version , contains the following versions of the RISC-V ISA
modules:


Module
Version
Status


Machine ISA
1.13
Draft


Smrnmi Extension
0.1
Draft


Supervisor ISA
1.12
Ratified


Svnapot Extension
1.0
Ratified


Svpbmt Extension
1.0
Ratified


Svinval Extension
1.0
Ratified


Hypervisor ISA
1.0
Ratified


The following compatible changes have been made to the Machine ISA since
version 1.12:

Defined the misa.V field to reflect that the V extension has been
implemented.

Preface to Version 20211203

This document describes the RISC-V privileged architecture. This
release, version 20211203, contains the following versions of the RISC-V
ISA modules:


Module
Version
Status


Machine ISA
1.12
Ratified


Supervisor ISA
1.12
Ratified


Svnapot Extension
1.0
Ratified


Svpbmt Extension
1.0
Ratified


Svinval Extension
1.0
Ratified


Hypervisor ISA
1.0
Ratified


The following changes have been made since version 1.11, which, while
not strictly backwards compatible, are not anticipated to cause software
portability problems in practice:


Changed MRET and SRET to clear mstatus.MPRV when leaving M-mode.


Reserved additional satp patterns for future use.


Stated that the scause Exception Code field must implement bits
4–0 at minimum.


Relaxed I/O regions have been specified to follow RVWMO. The
previous specification implied that PPO rules other than fences and
acquire/release annotations did not apply.


Constrained the LR/SC reservation set size and shape when using
page-based virtual memory.


PMP changes require an SFENCE.VMA on any hart that implements
page-based virtual memory, even if VM is not currently enabled.


Allowed for speculative updates of page table entry A bits.


Clarify that if the address-translation algorithm non-speculatively
reaches a PTE in which a bit reserved for future standard use is
set, a page-fault exception must be raised.


Additionally, the following compatible changes have been made since
version 1.11:


Removed the N extension.


Defined the mandatory RV32-only CSR mstatush, which contains most
of the same fields as the upper 32 bits of RV64’s mstatus.


Defined the mandatory CSR mconfigptr, which if nonzero contains
the address of a configuration data structure.


Defined optional mseccfg and mseccfgh CSRs, which control the
machine’s security configuration.


Defined menvcfg, henvcfg, and senvcfg CSRs (and RV32-only
menvcfgh and henvcfgh CSRs), which control various
characteristics of the execution environment.


Designated part of SYSTEM major opcode for custom use.


Permitted the unconditional delegation of less-privileged
interrupts.


Added optional big-endian and bi-endian support.


Made priority of load/store/AMO address-misaligned exceptions
implementation-defined relative to load/store/AMO page-fault and
access-fault exceptions.


PMP reset values are now platform-defined.


An additional 48 optional PMP registers have been defined.


Slightly relaxed the atomicity requirement for A and D bit updates
performed by the implementation.


Clarify the architectural behavior of address-translation caches


Added Sv57 and Sv57x4 address translation modes.


Software breakpoint exceptions are permitted to write either 0 or
the pc to xtval.


Clarified that bare S-mode need not support the SFENCE.VMA
instruction.


Specified relaxed constraints for implicit reads of non-idempotent
regions.


Added the Svnapot Standard Extension, along with the N bit in Sv39,
Sv48, and Sv57 PTEs.


Added the Svpbmt Standard Extension, along with the PBMT bits in
Sv39, Sv48, and Sv57 PTEs.


Added the Svinval Standard Extension and associated instructions.


Finally, the hypervisor architecture proposal has been extensively
revised.
Preface to Version 1.11

This is version 1.11 of the RISC-V privileged architecture. The document
contains the following versions of the RISC-V ISA modules:


Module
Version
Status


Machine ISA
1.11
Ratified


Supervisor ISA
1.11
Ratified


Hypervisor ISA
0.3
Draft


Changes from version 1.10 include:


Moved Machine and Supervisor spec to Ratified status.


Improvements to the description and commentary.


Added a draft proposal for a hypervisor extension.


Specified which interrupt sources are reserved for standard use.


Allocated some synchronous exception causes for custom use.


Specified the priority ordering of synchronous exceptions.


Added specification that xRET instructions may, but are not required
to, clear LR reservations if A extension present.


The virtual-memory system no longer permits supervisor mode to
execute instructions from user pages, regardless of the SUM setting.


Clarified that ASIDs are private to a hart, and added commentary
about the possibility of a future global-ASID extension.


SFENCE.VMA semantics have been clarified.


Made the mstatus.MPP field , rather than .


Made the unused xip fields , rather than .


Made the unused misa fields , rather than .


Made the unused pmpaddr and pmpcfg fields , rather than .


Required all harts in a system to employ the same PTE-update scheme
as each other.


Rectified an editing error that misdescribed the mechanism by which
mstatus.xIE is written upon an exception.


Described scheme for emulating misaligned AMOs.


Specified the behavior of the misa and xepc registers in
systems with variable IALIGN.


Specified the behavior of writing self-contradictory values to the
misa register.


Defined the mcountinhibit CSR, which stops performance counters
from incrementing to reduce energy consumption.


Specified semantics for PMP regions coarser than four bytes.


Specified contents of CSRs across XLEN modification.


Moved PLIC chapter into its own document.


Preface to Version 1.10

This is version 1.10 of the RISC-V privileged architecture proposal.
Changes from version 1.9.1 include:


The previous version of this document was released under a Creative
Commons Attribution 4.0 International License by the original
authors, and this and future versions of this document will be
released under the same license.


The explicit convention on shadow CSR addresses has been removed to
reclaim CSR space. Shadow CSRs can still be added as needed.


The mvendorid register now contains the JEDEC code of the core
provider as opposed to a code supplied by the Foundation. This
avoids redundancy and offloads work from the Foundation.


The interrupt-enable stack discipline has been simplified.


An optional mechanism to change the base ISA used by supervisor and
user modes has been added to the mstatus CSR, and the field
previously called Base in misa has been renamed to  MXL for
consistency.


Clarified expected use of XS to summarize additional extension state
status fields in mstatus.


Optional vectored interrupt support has been added to the mtvec
and stvec CSRs.


The SEIP and UEIP bits in the mip CSR have been redefined to
support software injection of external interrupts.


The mbadaddr register has been subsumed by a more general mtval
register that can now capture bad instruction bits on an illegal
instruction fault to speed instruction emulation.


The machine-mode base-and-bounds translation and protection schemes
have been removed from the specification as part of moving the
virtual memory configuration to sptbr (now satp). Some of the
motivation for the base and bound schemes are now covered by the PMP
registers, but space remains available in mstatus to add these
back at a later date if deemed useful.


In systems with only M-mode, or with both M-mode and U-mode but
without U-mode trap support, the medeleg and mideleg registers
now do not exist, whereas previously they returned zero.


Virtual-memory page faults now have mcause values distinct from
physical-memory access faults. Page-fault exceptions can now be
delegated to S-mode without delegating exceptions generated by PMA
and PMP checks.


An optional physical-memory protection (PMP) scheme has been
proposed.


The supervisor virtual memory configuration has been moved from the
mstatus register to the sptbr register. Accordingly, the sptbr
register has been renamed to satp (Supervisor Address Translation
and Protection) to reflect its broadened role.


The SFENCE.VM instruction has been removed in favor of the improved
SFENCE.VMA instruction.


The mstatus bit MXR has been exposed to S-mode via sstatus.


The polarity of the PUM bit in sstatus has been inverted to
shorten code sequences involving MXR. The bit has been renamed to
SUM.


Hardware management of page-table entry Accessed and Dirty bits has
been made optional; simpler implementations may trap to software to
set them.


The counter-enable scheme has changed, so that S-mode can control
availability of counters to U-mode.


H-mode has been removed, as we are focusing on recursive
virtualization support in S-mode. The encoding space has been
reserved and may be repurposed at a later date.


A mechanism to improve virtualization performance by trapping S-mode
virtual-memory management operations has been added.


The Supervisor Binary Interface (SBI) chapter has been removed, so
that it can be maintained as a separate specification.


Preface to Version 1.9.1

This is version 1.9.1 of the RISC-V privileged architecture proposal.
Changes from version 1.9 include:


Numerous additions and improvements to the commentary sections.


Change configuration string proposal to be use a search process that
supports various formats including Device Tree String and flattened
Device Tree.


Made misa optionally writable to support modifying base and
supported ISA extensions. CSR address of misa changed.


Added description of debug mode and debug CSRs.


Added a hardware performance monitoring scheme. Simplified the
handling of existing hardware counters, removing privileged versions
of the counters and the corresponding delta registers.


Fixed description of SPIE in presence of user-level interrupts.


Introduction

This document describes the RISC-V privileged architecture, which covers
all aspects of RISC-V systems beyond the unprivileged ISA, including
privileged instructions as well as additional functionality required for
running operating systems and attaching external devices.

Commentary on our design decisions is formatted as in this paragraph,
and can be skipped if the reader is only interested in the specification
itself.


We briefly note that the entire privileged-level design described in
this document could be replaced with an entirely different
privileged-level design without changing the unprivileged ISA, and
possibly without even changing the ABI. In particular, this privileged
specification was designed to run existing popular operating systems,
and so embodies the conventional level-based protection model. Alternate
privileged specifications could embody other more flexible
protection-domain models. For simplicity of expression, the text is
written as if this was the only possible privileged architecture.

RISC-V Privileged Software Stack Terminology

This section describes the terminology we use to describe components of
the wide range of possible privileged software stacks for RISC-V.
Figure 1.1 shows some of the possible
software stacks that can be supported by the RISC-V architecture. The
left-hand side shows a simple system that supports only a single
application running on an application execution environment (AEE). The
application is coded to run with a particular application binary
interface (ABI). The ABI includes the supported user-level ISA plus a
set of ABI calls to interact with the AEE. The ABI hides details of the
AEE from the application to allow greater flexibility in implementing
the AEE. The same ABI could be implemented natively on multiple
different host OSs, or could be supported by a user-mode emulation
environment running on a machine with a different native ISA.


Different implementation stacks
supporting various forms of privileged execution.


Our graphical convention represents abstract interfaces using black
boxes with white text, to separate them from concrete instances of
components implementing the interfaces.

The middle configuration shows a conventional operating system (OS) that
can support multiprogrammed execution of multiple applications. Each
application communicates over an ABI with the OS, which provides the
AEE. Just as applications interface with an AEE via an ABI, RISC-V
operating systems interface with a supervisor execution environment
(SEE) via a supervisor binary interface (SBI). An SBI comprises the
user-level and supervisor-level ISA together with a set of SBI function
calls. Using a single SBI across all SEE implementations allows a single
OS binary image to run on any SEE. The SEE can be a simple boot loader
and BIOS-style IO system in a low-end hardware platform, or a
hypervisor-provided virtual machine in a high-end server, or a thin
translation layer over a host operating system in an architecture
simulation environment.

Most supervisor-level ISA definitions do not separate the SBI from the
execution environment and/or the hardware platform, complicating
virtualization and bring-up of new hardware platforms.

The rightmost configuration shows a virtual machine monitor
configuration where multiple multiprogrammed OSs are supported by a
single hypervisor. Each OS communicates via an SBI with the hypervisor,
which provides the SEE. The hypervisor communicates with the hypervisor
execution environment (HEE) using a hypervisor binary interface (HBI),
to isolate the hypervisor from details of the hardware platform.

The ABI, SBI, and HBI are still a work-in-progress, but we are now
prioritizing support for Type-2 hypervisors where the SBI is provided
recursively by an S-mode OS.

Hardware implementations of the RISC-V ISA will generally require
additional features beyond the privileged ISA to support the various
execution environments (AEE, SEE, or HEE).
Privilege Levels

At any time, a RISC-V hardware thread (hart) is running at some
privilege level encoded as a mode in one or more CSRs (control and
status registers). Three RISC-V privilege levels are currently defined
as shown in Table [privlevels].


Level
Encoding
Name
Abbreviation


0
00 
User/Application
U


1
01 
Supervisor
S


2
10 
Reserved


3
11 
Machine
M


Privilege levels are used to provide protection between different
components of the software stack, and attempts to perform operations not
permitted by the current privilege mode will cause an exception to be
raised. These exceptions will normally cause traps into an underlying
execution environment.

In the description, we try to separate the privilege level for which
code is written, from the privilege mode in which it runs, although the
two are often tied. For example, a supervisor-level operating system can
run in supervisor-mode on a system with three privilege modes, but can
also run in user-mode under a classic virtual machine monitor on systems
with two or more privilege modes. In both cases, the same
supervisor-level operating system binary code can be used, coded to a
supervisor-level SBI and hence expecting to be able to use
supervisor-level privileged instructions and CSRs. When running a guest
OS in user mode, all supervisor-level actions will be trapped and
emulated by the SEE running in the higher-privilege level.

The machine level has the highest privileges and is the only mandatory
privilege level for a RISC-V hardware platform. Code run in machine-mode
(M-mode) is usually inherently trusted, as it has low-level access to
the machine implementation. M-mode can be used to manage secure
execution environments on RISC-V. User-mode (U-mode) and supervisor-mode
(S-mode) are intended for conventional application and operating system
usage respectively.
Each privilege level has a core set of privileged ISA extensions with
optional extensions and variants. For example, machine-mode supports an
optional standard extension for memory protection. Also, supervisor mode
can be extended to support Type-2 hypervisor execution as described in
Chapter [hypervisor].
Implementations might provide anywhere from 1 to 3 privilege modes
trading off reduced isolation for lower implementation cost, as shown in
Table [privcombs].


Number of levels
Supported Modes
Intended Usage


1
M
Simple embedded systems


2
M, U
Secure embedded systems


3
M, S, U
Systems running Unix-like operating systems


All hardware implementations must provide M-mode, as this is the only
mode that has unfettered access to the whole machine. The simplest
RISC-V implementations may provide only M-mode, though this will provide
no protection against incorrect or malicious application code.

The lock feature of the optional PMP facility can provide some limited
protection even with only M-mode implemented.

Many RISC-V implementations will also support at least user mode
(U-mode) to protect the rest of the system from application code.
Supervisor mode (S-mode) can be added to provide isolation between a
supervisor-level operating system and the SEE.
A hart normally runs application code in U-mode until some trap (e.g., a
supervisor call or a timer interrupt) forces a switch to a trap handler,
which usually runs in a more privileged mode. The hart will then execute
the trap handler, which will eventually resume execution at or after the
original trapped instruction in U-mode. Traps that increase privilege
level are termed vertical traps, while traps that remain at the same
privilege level are termed horizontal traps. The RISC-V privileged
architecture provides flexible routing of traps to different privilege
layers.

Horizontal traps can be implemented as vertical traps that return
control to a horizontal trap handler in the less-privileged mode.

Debug Mode

Implementations may also include a debug mode to support off-chip
debugging and/or manufacturing test. Debug mode (D-mode) can be
considered an additional privilege mode, with even more access than
M-mode. The separate debug specification proposal describes operation of
a RISC-V hart in debug mode. Debug mode reserves a few CSR addresses
that are only accessible in D-mode, and may also reserve some portions
of the physical address space on a platform.
Control and Status Registers (CSRs)

The SYSTEM major opcode is used to encode all privileged instructions in
the RISC-V ISA. These can be divided into two main classes: those that
atomically read-modify-write control and status registers (CSRs), which
are defined in the Zicsr extension, and all other privileged
instructions. The privileged architecture requires the Zicsr extension;
which other privileged instructions are required depends on the
privileged-architecture feature set.
In addition to the unprivileged state described in Volume I of this
manual, an implementation may contain additional CSRs, accessible by
some subset of the privilege levels using the CSR instructions described
in Volume I. In this chapter, we map out the CSR address space. The
following chapters describe the function of each of the CSRs according
to privilege level, as well as the other privileged instructions which
are generally closely associated with a particular privilege level. Note
that although CSRs and instructions are associated with one privilege
level, they are also accessible at all higher privilege levels.
Standard CSRs do not have side effects on reads but may have side
effects on writes.
CSR Address Mapping Conventions

The standard RISC-V ISA sets aside a 12-bit encoding space (csr[11:0])
for up to 4,096 CSRs. By convention, the upper 4 bits of the CSR address
(csr[11:8]) are used to encode the read and write accessibility of the
CSRs according to privilege level as shown in
Table [csrrwpriv]. The top two bits
(csr[11:10]) indicate whether the register is read/write (00, 01,
or 10) or read-only (11). The next two bits (csr[9:8]) encode the
lowest privilege level that can access the CSR.

The CSR address convention uses the upper bits of the CSR address to
encode default access privileges. This simplifies error checking in the
hardware and provides a larger CSR space, but does constrain the mapping
of CSRs into the address space.
Implementations might allow a more-privileged level to trap otherwise
permitted CSR accesses by a less-privileged level to allow these
accesses to be intercepted. This change should be transparent to the
less-privileged software.


CSR Address


Hex
Use and Accessibility


[9:8]
[7:4]


Unprivileged and User-Level CSRs


00 
00 
XXXX 
0x000-0x0FF 
Standard read/write


01 
00 
XXXX 
0x400-0x4FF 
Standard read/write


10 
00 
XXXX 
0x800-0x8FF 
Custom read/write


11 
00 
0XXX 
0xC00-0xC7F 
Standard read-only


11 
00 
10XX 
0xC80-0xCBF 
Standard read-only


11 
00 
11XX 
0xCC0-0xCFF 
Custom read-only


Supervisor-Level CSRs


00 
01 
XXXX 
0x100-0x1FF 
Standard read/write


01 
01 
0XXX 
0x500-0x57F 
Standard read/write


01 
01 
10XX 
0x580-0x5BF 
Standard read/write


01 
01 
11XX 
0x5C0-0x5FF 
Custom read/write


10 
01 
0XXX 
0x900-0x97F 
Standard read/write


10 
01 
10XX 
0x980-0x9BF 
Standard read/write


10 
01 
11XX 
0x9C0-0x9FF 
Custom read/write


11 
01 
0XXX 
0xD00-0xD7F 
Standard read-only


11 
01 
10XX 
0xD80-0xDBF 
Standard read-only


11 
01 
11XX 
0xDC0-0xDFF 
Custom read-only


Hypervisor and VS CSRs


00 
10 
XXXX 
0x200-0x2FF 
Standard read/write


01 
10 
0XXX 
0x600-0x67F 
Standard read/write


01 
10 
10XX 
0x680-0x6BF 
Standard read/write


01 
10 
11XX 
0x6C0-0x6FF 
Custom read/write


10 
10 
0XXX 
0xA00-0xA7F 
Standard read/write


10 
10 
10XX 
0xA80-0xABF 
Standard read/write


10 
10 
11XX 
0xAC0-0xAFF 
Custom read/write


11 
10 
0XXX 
0xE00-0xE7F 
Standard read-only


11 
10 
10XX 
0xE80-0xEBF 
Standard read-only


11 
10 
11XX 
0xEC0-0xEFF 
Custom read-only


Machine-Level CSRs


00 
11 
XXXX 
0x300-0x3FF 
Standard read/write


01 
11 
0XXX 
0x700-0x77F 
Standard read/write


01 
11 
100X 
0x780-0x79F 
Standard read/write


01 
11 
1010 
0x7A0-0x7AF 
Standard read/write debug CSRs


01 
11 
1011 
0x7B0-0x7BF 
Debug-mode-only CSRs


01 
11 
11XX 
0x7C0-0x7FF 
Custom read/write


10 
11 
0XXX 
0xB00-0xB7F 
Standard read/write


10 
11 
10XX 
0xB80-0xBBF 
Standard read/write


10 
11 
11XX 
0xBC0-0xBFF 
Custom read/write


11 
11 
0XXX 
0xF00-0xF7F 
Standard read-only


11 
11 
10XX 
0xF80-0xFBF 
Standard read-only


11 
11 
11XX 
0xFC0-0xFFF 
Custom read-only


Attempts to access a non-existent CSR raise an illegal instruction
exception. Attempts to access a CSR without appropriate privilege level
or to write a read-only register also raise illegal instruction
exceptions. A read/write register might also contain some bits that are
read-only, in which case writes to the read-only bits are ignored.
Table [csrrwpriv] also indicates the convention
to allocate CSR addresses between standard and custom uses. The CSR
addresses designated for custom uses will not be redefined by future
standard extensions.
Machine-mode standard read-write CSRs 0x7A0–0x7BF are reserved for
use by the debug system. Of these CSRs, 0x7A0–0x7AF are accessible
to machine mode, whereas 0x7B0–0x7BF are only visible to debug mode.
Implementations should raise illegal instruction exceptions on
machine-mode access to the latter set of registers.

Effective virtualization requires that as many instructions run natively
as possible inside a virtualized environment, while any privileged
accesses trap to the virtual machine monitor . CSRs that are read-only
at some lower privilege level are shadowed into separate CSR addresses
if they are made read-write at a higher privilege level. This avoids
trapping permitted lower-privilege accesses while still causing traps on
illegal accesses. Currently, the counters are the only shadowed CSRs.

CSR Listing

Tables 1.1–1.5 list the CSRs that have currently
been allocated CSR addresses. The timers, counters, and floating-point
CSRs are standard unprivileged CSRs. The other registers are used by
privileged code, as described in the following chapters. Note that not
all registers are required on all implementations.


Number
Privilege
Name
Description


Unprivileged Floating-Point CSRs


0x001 
URW
fflags 
Floating-Point Accrued Exceptions.


0x002 
URW
frm 
Floating-Point Dynamic Rounding Mode.


0x003 
URW
fcsr 
Floating-Point Control and Status Register (frm + fflags).


Unprivileged Counter/Timers


0xC00 
URO
cycle 
Cycle counter for RDCYCLE instruction.


0xC01 
URO
time 
Timer for RDTIME instruction.


0xC02 
URO
instret 
Instructions-retired counter for RDINSTRET instruction.


0xC03 
URO
hpmcounter3 
Performance-monitoring counter.


0xC04 
URO
hpmcounter4 
Performance-monitoring counter.


⋮
 

0xC1F 
URO
hpmcounter31 
Performance-monitoring counter.


0xC80 
URO
cycleh 
Upper 32 bits of cycle, RV32 only.


0xC81 
URO
timeh 
Upper 32 bits of time, RV32 only.


0xC82 
URO
instreth 
Upper 32 bits of instret, RV32 only.


0xC83 
URO
hpmcounter3h 
Upper 32 bits of hpmcounter3, RV32 only.


0xC84 
URO
hpmcounter4h 
Upper 32 bits of hpmcounter4, RV32 only.


⋮
 

0xC9F 
URO
hpmcounter31h 
Upper 32 bits of hpmcounter31, RV32 only.


Currently allocated RISC-V unprivileged CSR addresses.


Number
Privilege
Name
Description


Supervisor Trap Setup


0x100 
SRW
sstatus 
Supervisor status register.


0x104 
SRW
sie 
Supervisor interrupt-enable register.


0x105 
SRW
stvec 
Supervisor trap handler base address.


0x106 
SRW
scounteren 
Supervisor counter enable.


Supervisor Configuration


0x10A 
SRW
senvcfg 
Supervisor environment configuration register.


Supervisor Trap Handling


0x140 
SRW
sscratch 
Scratch register for supervisor trap handlers.


0x141 
SRW
sepc 
Supervisor exception program counter.


0x142 
SRW
scause 
Supervisor trap cause.


0x143 
SRW
stval 
Supervisor bad address or instruction.


0x144 
SRW
sip 
Supervisor interrupt pending.


Supervisor Protection and Translation


0x180 
SRW
satp 
Supervisor address translation and protection.


Debug/Trace Registers


0x5A8 
SRW
scontext 
Supervisor-mode context register.


Currently allocated RISC-V supervisor-level CSR addresses.


Number
Privilege
Name
Description


Hypervisor Trap Setup


0x600 
HRW
hstatus 
Hypervisor status register.


0x602 
HRW
hedeleg 
Hypervisor exception delegation register.


0x603 
HRW
hideleg 
Hypervisor interrupt delegation register.


0x604 
HRW
hie 
Hypervisor interrupt-enable register.


0x606 
HRW
hcounteren 
Hypervisor counter enable.


0x607 
HRW
hgeie 
Hypervisor guest external interrupt-enable register.


Hypervisor Trap Handling


0x643 
HRW
htval 
Hypervisor bad guest physical address.


0x644 
HRW
hip 
Hypervisor interrupt pending.


0x645 
HRW
hvip 
Hypervisor virtual interrupt pending.


0x64A 
HRW
htinst 
Hypervisor trap instruction (transformed).


0xE12 
HRO
hgeip 
Hypervisor guest external interrupt pending.


Hypervisor Configuration


0x60A 
HRW
henvcfg 
Hypervisor environment configuration register.


0x61A 
HRW
henvcfgh 
Additional hypervisor env. conf. register, RV32 only.


Hypervisor Protection and Translation


0x680 
HRW
hgatp 
Hypervisor guest address translation and protection.


Debug/Trace Registers


0x6A8 
HRW
hcontext 
Hypervisor-mode context register.


Hypervisor Counter/Timer Virtualization Registers


0x605 
HRW
htimedelta 
Delta for VS/VU-mode timer.


0x615 
HRW
htimedeltah 
Upper 32 bits of htimedelta, HSXLEN=32 only.


Virtual Supervisor Registers


0x200 
HRW
vsstatus 
Virtual supervisor status register.


0x204 
HRW
vsie 
Virtual supervisor interrupt-enable register.


0x205 
HRW
vstvec 
Virtual supervisor trap handler base address.


0x240 
HRW
vsscratch 
Virtual supervisor scratch register.


0x241 
HRW
vsepc 
Virtual supervisor exception program counter.


0x242 
HRW
vscause 
Virtual supervisor trap cause.


0x243 
HRW
vstval 
Virtual supervisor bad address or instruction.


0x244 
HRW
vsip 
Virtual supervisor interrupt pending.


0x280 
HRW
vsatp 
Virtual supervisor address translation and protection.


Currently allocated RISC-V hypervisor and VS CSR addresses.


Number
Privilege
Name
Description


Machine Information Registers


0xF11 
MRO
mvendorid 
Vendor ID.


0xF12 
MRO
marchid 
Architecture ID.


0xF13 
MRO
mimpid 
Implementation ID.


0xF14 
MRO
mhartid 
Hardware thread ID.


0xF15 
MRO
mconfigptr 
Pointer to configuration data structure.


Machine Trap Setup


0x300 
MRW
mstatus 
Machine status register.


0x301 
MRW
misa 
ISA and extensions


0x302 
MRW
medeleg 
Machine exception delegation register.


0x303 
MRW
mideleg 
Machine interrupt delegation register.


0x304 
MRW
mie 
Machine interrupt-enable register.


0x305 
MRW
mtvec 
Machine trap-handler base address.


0x306 
MRW
mcounteren 
Machine counter enable.


0x310 
MRW
mstatush 
Additional machine status register, RV32 only.


Machine Trap Handling


0x340 
MRW
mscratch 
Scratch register for machine trap handlers.


0x341 
MRW
mepc 
Machine exception program counter.


0x342 
MRW
mcause 
Machine trap cause.


0x343 
MRW
mtval 
Machine bad address or instruction.


0x344 
MRW
mip 
Machine interrupt pending.


0x34A 
MRW
mtinst 
Machine trap instruction (transformed).


0x34B 
MRW
mtval2 
Machine bad guest physical address.


Machine Configuration


0x30A 
MRW
menvcfg 
Machine environment configuration register.


0x31A 
MRW
menvcfgh 
Additional machine env. conf. register, RV32 only.


0x747 
MRW
mseccfg 
Machine security configuration register.


0x757 
MRW
mseccfgh 
Additional machine security conf. register, RV32 only.


Machine Memory Protection


0x3A0 
MRW
pmpcfg0 
Physical memory protection configuration.


0x3A1 
MRW
pmpcfg1 
Physical memory protection configuration, RV32 only.


0x3A2 
MRW
pmpcfg2 
Physical memory protection configuration.


0x3A3 
MRW
pmpcfg3 
Physical memory protection configuration, RV32 only.


⋮
 

0x3AE 
MRW
pmpcfg14 
Physical memory protection configuration.


0x3AF 
MRW
pmpcfg15 
Physical memory protection configuration, RV32 only.


0x3B0 
MRW
pmpaddr0 
Physical memory protection address register.


0x3B1 
MRW
pmpaddr1 
Physical memory protection address register.


⋮
 

0x3EF 
MRW
pmpaddr63 
Physical memory protection address register.


Currently allocated RISC-V machine-level CSR addresses.


Number
Privilege
Name
Description


Machine Non-Maskable Interrupt Handling


0x740 
MRW
mnscratch 
Resumable NMI scratch register.


0x741 
MRW
mnepc 
Resumable NMI program counter.


0x742 
MRW
mncause 
Resumable NMI cause.


0x744 
MRW
mnstatus 
Resumable NMI status.


Machine Counter/Timers


0xB00 
MRW
mcycle 
Machine cycle counter.


0xB02 
MRW
minstret 
Machine instructions-retired counter.


0xB03 
MRW
mhpmcounter3 
Machine performance-monitoring counter.


0xB04 
MRW
mhpmcounter4 
Machine performance-monitoring counter.


⋮
 

0xB1F 
MRW
mhpmcounter31 
Machine performance-monitoring counter.


0xB80 
MRW
mcycleh 
Upper 32 bits of mcycle, RV32 only.


0xB82 
MRW
minstreth 
Upper 32 bits of minstret, RV32 only.


0xB83 
MRW
mhpmcounter3h 
Upper 32 bits of mhpmcounter3, RV32 only.


0xB84 
MRW
mhpmcounter4h 
Upper 32 bits of mhpmcounter4, RV32 only.


⋮
 

0xB9F 
MRW
mhpmcounter31h 
Upper 32 bits of mhpmcounter31, RV32 only.


Machine Counter Setup


0x320 
MRW
mcountinhibit 
Machine counter-inhibit register.


0x323 
MRW
mhpmevent3 
Machine performance-monitoring event selector.


0x324 
MRW
mhpmevent4 
Machine performance-monitoring event selector.


⋮
 

0x33F 
MRW
mhpmevent31 
Machine performance-monitoring event selector.


Debug/Trace Registers (shared with Debug Mode)


0x7A0 
MRW
tselect 
Debug/Trace trigger register select.


0x7A1 
MRW
tdata1 
First Debug/Trace trigger data register.


0x7A2 
MRW
tdata2 
Second Debug/Trace trigger data register.


0x7A3 
MRW
tdata3 
Third Debug/Trace trigger data register.


0x7A8 
MRW
mcontext 
Machine-mode context register.


Debug Mode Registers


0x7B0 
DRW
dcsr 
Debug control and status register.


0x7B1 
DRW
dpc 
Debug program counter.


0x7B2 
DRW
dscratch0 
Debug scratch register 0.


0x7B3 
DRW
dscratch1 
Debug scratch register 1.


Currently allocated RISC-V machine-level CSR addresses.


CSR Field Specifications

The following definitions and abbreviations are used in specifying the
behavior of fields within the CSRs.
Reserved Writes Preserve Values, Reads Ignore Values (WPRI)

Some whole read/write fields are reserved for future use. Software
should ignore the values read from these fields, and should preserve the
values held in these fields when writing values to other fields of the
same register. For forward compatibility, implementations that do not
furnish these fields must make them read-only zero. These fields are
labeled  in the register descriptions.

To simplify the software model, any backward-compatible future
definition of previously reserved fields within a CSR must cope with the
possibility that a non-atomic read/modify/write sequence is used to
update other fields in the CSR. Alternatively, the original CSR
definition must specify that subfields can only be updated atomically,
which may require a two-instruction clear bit/set bit sequence in
general that can be problematic if intermediate values are not legal.

Write/Read Only Legal Values (WLRL)

Some read/write CSR fields specify behavior for only a subset of
possible bit encodings, with other bit encodings reserved. Software
should not write anything other than legal values to such a field, and
should not assume a read will return a legal value unless the last write
was of a legal value, or the register has not been written since another
operation (e.g., reset) set the register to a legal value. These fields
are labeled  in the register descriptions.

Hardware implementations need only implement enough state bits to
differentiate between the supported values, but must always return the
complete specified bit-encoding of any supported value when read.

Implementations are permitted but not required to raise an illegal
instruction exception if an instruction attempts to write a
non-supported value to a  field. Implementations can return arbitrary
bit patterns on the read of a  field when the last write was of an
illegal value, but the value returned should deterministically depend on
the illegal written value and the value of the field prior to the write.
Write Any Values, Reads Legal Values (WARL)

Some read/write CSR fields are only defined for a subset of bit
encodings, but allow any value to be written while guaranteeing to
return a legal value whenever read. Assuming that writing the CSR has no
other side effects, the range of supported values can be determined by
attempting to write a desired setting then reading to see if the value
was retained. These fields are labeled  in the register descriptions.
Implementations will not raise an exception on writes of unsupported
values to a  field. Implementations can return any legal value on the
read of a  field when the last write was of an illegal value, but the
legal value returned should deterministically depend on the illegal
written value and the architectural state of the hart.
CSR Field Modulation

If a write to one CSR changes the set of legal values allowed for a
field of a second CSR, then unless specified otherwise, the second CSR’s
field immediately gets an  value from among its new legal values. This
is true even if the field’s value before the write remains legal after
the write; the value of the field may be changed in consequence of the
write to the controlling CSR.

As a special case of this rule, the value written to one CSR may control
whether a field of a second CSR is writable (with multiple legal values)
or is read-only. When a write to the controlling CSR causes the second
CSR’s field to change from previously read-only to now writable, that
field immediately gets an  but legal value, unless specified otherwise.


Some CSR fields are, when writable, defined as aliases of other CSR
fields. Let x be such a CSR field, and let y be the CSR field it
aliases when writable. If a write to a controlling CSR causes field x
to change from previously read-only to now writable, the new value of
x is not  but instead immediately reflects the existing value of its
alias y, as required.

A change to the value of a CSR for this reason is not a write to the
affected CSR and thus does not trigger any side effects specified for
that CSR.
Implicit Reads of CSRs

Implementations sometimes perform implicit reads of CSRs. (For
example, all S-mode instruction fetches implicitly read the satp CSR.)
Unless otherwise specified, the value returned by an implicit read of a
CSR is the same value that would have been returned by an explicit read
of the CSR, using a CSR-access instruction in a sufficient privilege
mode.
CSR Width Modulation

If the width of a CSR is changed (for example, by changing MXLEN or
UXLEN, as described in
Section [xlen-control]), the values of the
writable fields and bits of the new-width CSR are, unless specified
otherwise, determined from the previous-width CSR as though by this
algorithm:


The value of the previous-width CSR is copied to a temporary
register of the same width.


For the read-only bits of the previous-width CSR, the bits at the
same positions in the temporary register are set to zeros.


The width of the temporary register is changed to the new width. If
the new width W is narrower than the previous width, the
least-significant W bits of the temporary register are retained
and the more-significant bits are discarded. If the new width is
wider than the previous width, the temporary register is
zero-extended to the wider width.


Each writable field of the new-width CSR takes the value of the bits
at the same positions in the temporary register.


Changing the width of a CSR is not a read or write of the CSR and thus
does not trigger any side effects.
Machine-Level ISA, Version 1.12

This chapter describes the machine-level operations available in
machine-mode (M-mode), which is the highest privilege mode in a RISC-V
system. M-mode is used for low-level access to a hardware platform and
is the first mode entered at reset. M-mode can also be used to implement
features that are too difficult or expensive to implement in hardware
directly. The RISC-V machine-level ISA contains a common core that is
extended depending on which other privilege levels are supported and
other details of the hardware implementation.
Machine-Level CSRs

In addition to the machine-level CSRs described in this section, M-mode
code can access all CSRs at lower privilege levels.
Machine ISA Register misa

The misa CSR is a  read-write register reporting the ISA supported by
the hart. This register must be readable in any implementation, but a
value of zero can be returned to indicate the misa register has not
been implemented, requiring that CPU capabilities be determined through
a separate non-standard mechanism.


| c | c | L | |

|:- |:-
| | |

| | MXLEN-28 | 26


The MXL (Machine XLEN) field encodes the native base integer ISA width
as shown in Table [misabase]. The MXL field may be writable
in implementations that support multiple base ISAs. The effective XLEN
in M-mode, MXLEN, is given by the setting of MXL, or has a fixed value
if misa is zero. The MXL field is always set to the widest supported
ISA variant at reset.


MXL
XLEN


1
32


2
64


3
128


The misa CSR is MXLEN bits wide. If the value read from misa is
nonzero, field MXL of that value always denotes the current MXLEN. If a
write to misa causes MXLEN to change, the position of MXL moves to the
most-significant two bits of misa at the new width.

The base width can be quickly ascertained using branches on the sign of
the returned misa value, and possibly a shift left by one and a second
branch on the sign. These checks can be written in assembly code without
knowing the register width (XLEN) of the machine. The base width is
given by XLEN = 2^MXL+4.
The base width can also be found if misa is zero, by placing the
immediate 4 in a register then shifting the register left by 31 bits at
a time. If zero after one shift, then the machine is RV32. If zero after
two shifts, then the machine is RV64, else RV128.

The Extensions field encodes the presence of the standard extensions,
with a single bit per letter of the alphabet (bit 0 encodes presence of
extension “A” , bit 1 encodes presence of extension “B”, through to bit
25 which encodes “Z”). The “I” bit will be set for RV32I, RV64I, RV128I
base ISAs, and the “E” bit will be set for RV32E. The Extensions field
is a  field that can contain writable bits where the implementation
allows the supported ISA to be modified. At reset, the Extensions field
shall contain the maximal set of supported extensions, and I shall be
selected over E if both are available.
When a standard extension is disabled by clearing its bit in misa, the
instructions and CSRs defined or modified by the extension revert to
their defined or reserved behaviors as if the extension is not
implemented.

For a given RISC-V execution environment, an instruction, extension, or
other feature of the RISC-V ISA is ordinarily judged to be implemented
or not by the observable execution behavior in that environment. For
example, the F extension is said to be implemented for an execution
environment if and only if the instructions that the RISC-V Unprivileged
ISA defines for F execute as specified.
With this definition of implemented, disabling an extension by
clearing its bit in misa results in the extension being considered
not implemented in M-mode. For example, setting misa.F=0 results in
the F extension being not implemented for M-mode, because the F
extension’s instructions will not act as the Unprivileged ISA requires
but may instead raise an illegal instruction exception.
Defining the term implemented based strictly on the observable
behavior might conflict with other common understandings of the same
word. In particular, although common usage may allow for the combination
“implemented but disabled,” in this document it is considered a
contradiction of terms, because disabled implies execution will not
behave as required for the feature to be considered implemented. In
the same vein, “implemented and enabled” is redundant here;
“implemented” suffices.

The design of the RV128I base ISA is not yet complete, and while much of
the remainder of this specification is expected to apply to RV128, this
version of the document focuses only on RV32 and RV64.
The “U” and “S” bits will be set if there is support for user and
supervisor modes respectively.
The “X” bit will be set if there are any non-standard extensions.


Bit
Character
Description


0
A
Atomic extension


1
B
Reserved


2
C
Compressed extension


3
D
Double-precision floating-point extension


4
E
RV32E base ISA


5
F
Single-precision floating-point extension


6
G
Reserved


7
H
Hypervisor extension


8
I
RV32I/64I/128I base ISA


9
J
Reserved


10
K
Reserved


11
L
Reserved


12
M
Integer Multiply/Divide extension


13
N
Tentatively reserved for User-Level Interrupts extension


14
O
Reserved


15
P
Tentatively reserved for Packed-SIMD extension


16
Q
Quad-precision floating-point extension


17
R
Reserved


18
S
Supervisor mode implemented


19
T
Reserved


20
U
User mode implemented


21
V
“V” Vector extension implemented


22
W
Reserved


23
X
Non-standard extensions present


24
Y
Reserved


25
Z
Reserved


The misa CSR exposes a rudimentary catalog of CPU features to
machine-mode code. More extensive information can be obtained in machine
mode by probing other machine registers, and examining other ROM storage
in the system as part of the boot process.
We require that lower privilege levels execute environment calls instead
of reading CPU registers to determine features available at each
privilege level. This enables virtualization layers to alter the ISA
observed at any level, and supports a much richer command interface
without burdening hardware designs.

The “E” bit is read-only. Unless misa is all read-only zero, the “E”
bit always reads as the complement of the “I” bit. If an execution
environment supports both RV32E and RV32I, software can select RV32E by
clearing the “I” bit.
If an ISA feature x depends on an ISA feature y, then attempting to
enable feature x but disable feature y results in both features
being disabled. For example, setting “F”=0 and “D”=1 results in both “F”
and “D” being cleared.
An implementation may impose additional constraints on the collective
setting of two or more misa fields, in which case they function
collectively as a single  field. An attempt to write an unsupported
combination causes those bits to be set to some supported combination.
Writing misa may increase IALIGN, e.g., by disabling the “C”
extension. If an instruction that would write misa increases IALIGN,
and the subsequent instruction’s address is not IALIGN-bit aligned, the
write to misa is suppressed, leaving misa unchanged.
When software enables an extension that was previously disabled, then
all state uniquely associated with that extension is , unless otherwise
specified by that extension.
Machine Vendor ID Register mvendorid

The mvendorid CSR is a 32-bit read-only register providing the JEDEC
manufacturer ID of the provider of the core. This register must be
readable in any implementation, but a value of 0 can be returned to
indicate the field is not implemented or that this is a non-commercial
implementation.


| JS |

| |

| | 7


JEDEC manufacturer IDs are ordinarily encoded as a sequence of one-byte
continuation codes 0x7f, terminated by a one-byte ID not equal to
0x7f, with an odd parity bit in the most-significant bit of each byte.
mvendorid encodes the number of one-byte continuation codes in the
Bank field, and encodes the final byte in the Offset field, discarding
the parity bit. For example, the JEDEC manufacturer ID
0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x7f 0x8a
(twelve continuation codes followed by 0x8a) would be encoded in the
mvendorid CSR as 0x60a.

In JEDEC’s parlance, the bank number is one greater than the number of
continuation codes; hence, the mvendorid Bank field encodes a value
that is one less than the JEDEC bank number.


Previously the vendor ID was to be a number allocated by RISC-V
International, but this duplicates the work of JEDEC in maintaining a
manufacturer ID standard. At time of writing, registering a manufacturer
ID with JEDEC has a one-time cost of $500.

Machine Architecture ID Register marchid

The marchid CSR is an MXLEN-bit read-only register encoding the base
microarchitecture of the hart. This register must be readable in any
implementation, but a value of 0 can be returned to indicate the field
is not implemented. The combination of mvendorid and  marchid should
uniquely identify the type of hart microarchitecture that is
implemented.


J
MXLEN


Open-source project architecture IDs are allocated globally by RISC-V
International, and have non-zero architecture IDs with a zero
most-significant-bit (MSB). Commercial architecture IDs are allocated by
each commercial vendor independently, but must have the MSB set and
cannot contain zero in the remaining MXLEN-1 bits.

The intent is for the architecture ID to represent the microarchitecture
associated with the repo around which development occurs rather than a
particular organization. Commercial fabrications of open-source designs
should (and might be required by the license to) retain the original
architecture ID. This will aid in reducing fragmentation and tool
support costs, as well as provide attribution. Open-source architecture
IDs are administered by RISC-V International and should only be
allocated to released, functioning open-source projects. Commercial
architecture IDs can be managed independently by any registered vendor
but are required to have IDs disjoint from the open-source architecture
IDs (MSB set) to prevent collisions if a vendor wishes to use both
closed-source and open-source microarchitectures.
The convention adopted within the following Implementation field can be
used to segregate branches of the same architecture design, including by
organization. The misa register also helps distinguish different
variants of a design.

Machine Implementation ID Register mimpid

The mimpid CSR provides a unique encoding of the version of the
processor implementation. This register must be readable in any
implementation, but a value of 0 can be returned to indicate that the
field is not implemented. The Implementation value should reflect the
design of the RISC-V processor itself and not any surrounding system.


J
MXLEN


The format of this field is left to the provider of the architecture
source code, but will often be printed by standard tools as a
hexadecimal string without any leading or trailing zeros, so the
Implementation value can be left-justified (i.e., filled in from
most-significant nibble down) with subfields aligned on nibble
boundaries to ease human readability.

Hart ID Register mhartid

The mhartid CSR is an MXLEN-bit read-only register containing the
integer ID of the hardware thread running the code. This register must
be readable in any implementation. Hart IDs might not necessarily be
numbered contiguously in a multiprocessor system, but at least one hart
must have a hart ID of zero. Hart IDs must be unique within the
execution environment.


J
MXLEN


In certain cases, we must ensure exactly one hart runs some code (e.g.,
at reset), and so require one hart to have a known hart ID of zero.
For efficiency, system implementers should aim to reduce the magnitude
of the largest hart ID used in a system.

Machine Status Registers (mstatus and mstatush)

The mstatus register is an MXLEN-bit read/write register formatted as
shown in Figure [mstatusreg-rv32] for RV32 and
Figure [mstatusreg] for RV64. The mstatus
register keeps track of and controls the hart’s current operating state.
A restricted view of mstatus appears as the sstatus register in the
S-level ISA.


cKccccccc

| | | | | | | | |

| | | | | | | | |

| | 8 | 1 | 1 | 1 | 1 | 1 | 1 |


cWWcWccccccccc

| | | | | | | | | | | | | |

| | | | | | | | | | | | | |

| | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


For RV32 only, mstatush is a 32-bit read/write register formatted as
shown in Figure [mstatushreg]. Bits 30:4 of mstatush
generally contain the same fields found in bits 62:36 of mstatus for
RV64. Fields SD, SXL, and UXL do not exist in mstatush.


JccF

| | | |

| | | |

| | 1 | 1 | 4


Privilege and Global Interrupt-Enable Stack in mstatus register

Global interrupt-enable bits, MIE and SIE, are provided for M-mode and
S-mode respectively. These bits are primarily used to guarantee
atomicity with respect to interrupt handlers in the current privilege
mode.

The global xIE bits are located in the low-order bits of mstatus,
allowing them to be atomically set or cleared with a single CSR
instruction.

When a hart is executing in privilege mode x, interrupts are globally
enabled when xIE=1 and globally disabled when xIE=0. Interrupts for
lower-privilege modes, w<x, are always globally disabled regardless
of the setting of any global wIE bit for the lower-privilege mode.
Interrupts for higher-privilege modes, y>x, are always globally
enabled regardless of the setting of the global yIE bit for the
higher-privilege mode. Higher-privilege-level code can use separate
per-interrupt enable bits to disable selected higher-privilege-mode
interrupts before ceding control to a lower-privilege mode.

A higher-privilege mode y could disable all of its interrupts before
ceding control to a lower-privilege mode but this would be unusual as it
would leave only a synchronous trap, non-maskable interrupt, or reset as
means to regain control of the hart.

To support nested traps, each privilege mode x that can respond to
interrupts has a two-level stack of interrupt-enable bits and privilege
modes. xPIE holds the value of the interrupt-enable bit active prior
to the trap, and xPP holds the previous privilege mode. The xPP
fields can only hold privilege modes up to x, so MPP is two bits wide
and SPP is one bit wide. When a trap is taken from privilege mode y
into privilege mode x, xPIE is set to the value of xIE; xIE is
set to 0; and xPP is set to y.

For lower privilege modes, any trap (synchronous or asynchronous) is
usually taken at a higher privilege mode with interrupts disabled upon
entry. The higher-level trap handler will either service the trap and
return using the stacked information, or, if not returning immediately
to the interrupted context, will save the privilege stack before
re-enabling interrupts, so only one entry per stack is required.

An MRET or SRET instruction is used to return from a trap in M-mode or
S-mode respectively. When executing an xRET instruction, supposing
xPP holds the value y, xIE is set to xPIE; the privilege mode is
changed to y; xPIE is set to 1; and xPP is set to the
least-privileged supported mode (U if U-mode is implemented, else M). If
y≠M, xRET also sets MPRV=0.

Setting xPP to the least-privileged supported mode on an xRET helps
identify software bugs in the management of the two-level privilege-mode
stack.

xPP fields are  fields that can hold only privilege mode x and any
implemented privilege mode lower than x. If privilege mode x is not
implemented, then xPP must be read-only 0.

M-mode software can determine whether a privilege mode is implemented by
writing that mode to MPP then reading it back.
If the machine provides only U and M modes, then only a single hardware
storage bit is required to represent either 00 or 11 in MPP.

Base ISA Control in mstatus Register

For RV64 systems, the SXL and UXL fields are  fields that control the
value of XLEN for S-mode and U-mode, respectively. The encoding of these
fields is the same as the MXL field of misa, shown in
Table [misabase]. The effective XLEN in S-mode
and U-mode are termed SXLEN and UXLEN, respectively.
For RV32 systems, the SXL and UXL fields do not exist, and SXLEN=32 and
UXLEN=32.
For RV64 systems, if S-mode is not supported, then SXL is read-only
zero. Otherwise, it is a  field that encodes the current value of SXLEN.
In particular, an implementation may make SXL be a read-only field whose
value always ensures that SXLEN=MXLEN.
For RV64 systems, if U-mode is not supported, then UXL is read-only
zero. Otherwise, it is a  field that encodes the current value of UXLEN.
In particular, an implementation may make UXL be a read-only field whose
value always ensures that UXLEN=MXLEN or UXLEN=SXLEN.
Whenever XLEN in any mode is set to a value less than the widest
supported XLEN, all operations must ignore source operand register bits
above the configured XLEN, and must sign-extend results to fill the
entire widest supported XLEN in the destination register. Similarly,
pc bits above XLEN are ignored, and when the pc is written, it is
sign-extended to fill the widest supported XLEN.

We require that operations always fill the entire underlying hardware
registers with defined values to avoid implementation-defined behavior.
To reduce hardware complexity, the architecture imposes no checks that
lower-privilege modes have XLEN settings less than or equal to the
next-higher privilege mode. In practice, such settings would almost
always be a software bug, but machine operation is well-defined even in
this case.

If MXLEN is changed from 32 to a wider width, each of mstatus fields
SXL and UXL, if not restricted to a single value, gets the value
corresponding to the widest supported width not wider than the new
MXLEN.
Memory Privilege in mstatus Register

The MPRV (Modify PRiVilege) bit modifies the effective privilege mode,
i.e., the privilege level at which loads and stores execute. When
MPRV=0, loads and stores behave as normal, using the translation and
protection mechanisms of the current privilege mode. When MPRV=1, load
and store memory addresses are translated and protected, and endianness
is applied, as though the current privilege mode were set to MPP.
Instruction address-translation and protection are unaffected by the
setting of MPRV. MPRV is read-only 0 if U-mode is not supported.
An MRET or SRET instruction that changes the privilege mode to a mode
less privileged than M also sets MPRV=0.
The MXR (Make eXecutable Readable) bit modifies the privilege with which
loads access virtual memory. When MXR=0, only loads from pages marked
readable (R=1 in Figure [sv32pte]) will succeed. When MXR=1, loads
from pages marked either readable or executable (R=1 or X=1) will
succeed. MXR has no effect when page-based virtual memory is not in
effect. MXR is read-only 0 if S-mode is not supported.

The MPRV and MXR mechanisms were conceived to improve the efficiency of
M-mode routines that emulate missing hardware features, e.g., misaligned
loads and stores. MPRV obviates the need to perform address translation
in software. MXR allows instruction words to be loaded from pages marked
execute-only.
The current privilege mode and the privilege mode specified by MPP might
have different XLEN settings. When MPRV=1, load and store memory
addresses are treated as though the current XLEN were set to MPP’s XLEN,
following the rules in
Section 1.1.6.2.

The SUM (permit Supervisor User Memory access) bit modifies the
privilege with which S-mode loads and stores access virtual memory. When
SUM=0, S-mode memory accesses to pages that are accessible by U-mode
(U=1 in Figure [sv32pte]) will fault. When SUM=1, these
accesses are permitted. SUM has no effect when page-based virtual memory
is not in effect. Note that, while SUM is ordinarily ignored when not
executing in S-mode, it is in effect when MPRV=1 and MPP=S. SUM is
read-only 0 if S-mode is not supported or if satp.MODE is read-only 0.
The MXR and SUM mechanisms only affect the interpretation of permissions
encoded in page-table entries. In particular, they have no impact on
whether access-fault exceptions are raised due to PMAs or PMP.
Endianness Control in mstatus and mstatush Registers

The MBE, SBE, and UBE bits in mstatus and mstatush are  fields that
control the endianness of memory accesses other than instruction
fetches. Instruction fetches are always little-endian.
MBE controls whether non-instruction-fetch memory accesses made from
M-mode (assuming mstatus.MPRV=0) are little-endian (MBE=0) or
big-endian (MBE=1).
If S-mode is not supported, SBE is read-only 0. Otherwise, SBE controls
whether explicit load and store memory accesses made from S-mode are
little-endian (SBE=0) or big-endian (SBE=1).
If U-mode is not supported, UBE is read-only 0. Otherwise, UBE controls
whether explicit load and store memory accesses made from U-mode are
little-endian (UBE=0) or big-endian (UBE=1).
For implicit accesses to supervisor-level memory management data
structures, such as page tables, endianness is always controlled by SBE.
Since changing SBE alters the implementation’s interpretation of these
data structures, if any such data structures remain in use across a
change to SBE, M-mode software must follow such a change to SBE by
executing an SFENCE.VMA instruction with rs1=x0 and rs2=x0.

Only in contrived scenarios will a given memory-management data
structure be interpreted as both little-endian and big-endian. In
practice, SBE will only be changed at runtime on world switches, in
which case neither the old nor new memory-management data structure will
be reinterpreted in a different endianness. In this case, no additional
SFENCE.VMA is necessary, beyond what would ordinarily be required for a
world switch.

If S-mode is supported, an implementation may make SBE be a read-only
copy of MBE. If U-mode is supported, an implementation may make UBE be a
read-only copy of either MBE or SBE.

An implementation supports only little-endian memory accesses if fields
MBE, SBE, and UBE are all read-only 0. An implementation supports only
big-endian memory accesses (aside from instruction fetches) if MBE is
read-only 1 and SBE and UBE are each read-only 1 when S-mode and U-mode
are supported.


Volume I defines a hart’s address space as a circular sequence of
2^XLEN bytes at consecutive addresses. The correspondence
between addresses and byte locations is fixed and not affected by any
endianness mode. Rather, the applicable endianness mode determines the
order of mapping between memory bytes and a multibyte quantity
(halfword, word, etc.).


Standard RISC-V ABIs are expected to be purely little-endian-only or
big-endian-only, with no accommodation for mixing endianness.
Nevertheless, endianness control has been defined so as to permit, for
instance, an OS of one endianness to execute user-mode programs of the
opposite endianness. Consideration has been given also to the
possibility of non-standard usages whereby software flips the endianness
of memory accesses as needed.


RISC-V instructions are uniformly little-endian to decouple instruction
encoding from the current endianness settings, for the benefit of both
hardware and software. Otherwise, for instance, a RISC-V assembler or
disassembler would always need to know the intended active endianness,
despite that the endianness mode might change dynamically during
execution. In contrast, by giving instructions a fixed endianness, it is
sometimes possible for carefully written software to be
endianness-agnostic even in binary form, much like position-independent
code.
The choice to have instructions be only little-endian does have
consequences, however, for RISC-V software that encodes or decodes
machine instructions. In big-endian mode, such software must account for
the fact that explicit loads and stores have endianness opposite that of
instructions, for example by swapping byte order after loads and before
stores.

Virtualization Support in mstatus Register

The TVM (Trap Virtual Memory) bit is a  field that supports intercepting
supervisor virtual-memory management operations. When TVM=1, attempts to
read or write the satp CSR or execute an SFENCE.VMA or SINVAL.VMA
instruction while executing in S-mode will raise an illegal instruction
exception. When TVM=0, these operations are permitted in S-mode. TVM is
read-only 0 when S-mode is not supported.

The TVM mechanism improves virtualization efficiency by permitting guest
operating systems to execute in S-mode, rather than classically
virtualizing them in U-mode. This approach obviates the need to trap
accesses to most S-mode CSRs.
Trapping satp accesses and the SFENCE.VMA and SINVAL.VMA instructions
provides the hooks necessary to lazily populate shadow page tables.

The TW (Timeout Wait) bit is a  field that supports intercepting the WFI
instruction (see
Section 1.3.3).
When TW=0, the WFI instruction may execute in lower privilege modes when
not prevented for some other reason. When TW=1, then if WFI is executed
in any less-privileged mode, and it does not complete within an
implementation-specific, bounded time limit, the WFI instruction causes
an illegal instruction exception. The time limit may always be 0, in
which case WFI always causes an illegal instruction exception in
less-privileged modes when TW=1. TW is read-only 0 when there are no
modes less privileged than M.

Trapping the WFI instruction can trigger a world switch to another guest
OS, rather than wastefully idling in the current guest.

When S-mode is implemented, then executing WFI in U-mode causes an
illegal instruction exception, unless it completes within an
implementation-specific, bounded time limit. A future revision of this
specification might add a feature that allows S-mode to selectively
permit WFI in U-mode. Such a feature would only be active when TW=0.
The TSR (Trap SRET) bit is a  field that supports intercepting the
supervisor exception return instruction, SRET. When TSR=1, attempts to
execute SRET while executing in S-mode will raise an illegal instruction
exception. When TSR=0, this operation is permitted in S-mode. TSR is
read-only 0 when S-mode is not supported.

Trapping SRET is necessary to emulate the hypervisor extension (see
Chapter [hypervisor]) on implementations that do
not provide it.

Extension Context Status in mstatus Register

Supporting substantial extensions is one of the primary goals of RISC-V,
and hence we define a standard interface to allow unchanged
privileged-mode code, particularly a supervisor-level OS, to support
arbitrary user-mode state extensions.

To date, the V extension is the only standard extension that defines
additional state beyond the floating-point CSR and data registers.

The FS[1:0] and VS[1:0]  fields and the XS[1:0] read-only field
are used to reduce the cost of context save and restore by setting and
tracking the current state of the floating-point unit and any other
user-mode extensions respectively. The FS field encodes the status of
the floating-point unit state, including the floating-point registers
f0–f31 and the CSRs fcsr, frm, and fflags. The VS field
encodes the status of the vector extension state, including the vector
registers v0–v31 and the CSRs vcsr, vxrm, vxsat, vstart,
vl, vtype, and vlenb. The XS field encodes the status of
additional user-mode extensions and associated state. These fields can
be checked by a context switch routine to quickly determine whether a
state save or restore is required. If a save or restore is required,
additional instructions and CSRs are typically required to effect and
optimize the process.

The design anticipates that most context switches will not need to
save/restore state in either or both of the floating-point unit or other
extensions, so provides a fast check via the SD bit.

The FS, VS, and XS fields use the same status encoding as shown in
Table [fsxsencoding], with the four possible
status values being Off, Initial, Clean, and Dirty.


Status
FS and VS Meaning
XS Meaning


0
Off
All off


1
Initial
None dirty or clean, some on


2
Clean
None dirty, some clean


3
Dirty
Some dirty


If the F extension is implemented, the FS field shall not be read-only
zero.
If neither the F extension nor S-mode is implemented, then FS is
read-only zero. If S-mode is implemented but the F extension is not, FS
may optionally be read-only zero.

Implementations with S-mode but without the F extension are permitted,
but not required, to make the FS field be read-only zero. Some such
implementations will choose not to have the FS field be read-only
zero, so as to enable emulation of the F extension for both S-mode and
U-mode via invisible traps into M-mode.

If the v registers are implemented, the VS field shall not be
read-only zero.
If neither the v registers nor S-mode is implemented, then VS is
read-only zero. If S-mode is implemented but the v registers are not,
VS may optionally be read-only zero.
In systems without additional user extensions requiring new state, the
XS field is read-only zero. Every additional extension with state
provides a CSR field that encodes the equivalent of the XS states. The
XS field represents a summary of all extensions’ status as shown in
Table [fsxsencoding].

The XS field effectively reports the maximum status value across all
user-extension status fields, though individual extensions can use a
different encoding than XS.

The SD bit is a read-only bit that summarizes whether either the FS, VS,
or XS fields signal the presence of some dirty state that will require
saving extended user context to memory. If FS, XS, and VS are all
read-only zero, then SD is also always zero.
When an extension’s status is set to Off, any instruction that attempts
to read or write the corresponding state will cause an illegal
instruction exception. When the status is Initial, the corresponding
state should have an initial constant value. When the status is Clean,
the corresponding state is potentially different from the initial value,
but matches the last value stored on a context swap. When the status is
Dirty, the corresponding state has potentially been modified since the
last context save.
During a context save, the responsible privileged code need only write
out the corresponding state if its status is Dirty, and can then reset
the extension’s status to Clean. During a context restore, the context
need only be loaded from memory if the status is Clean (it should never
be Dirty at restore). If the status is Initial, the context must be set
to an initial constant value on context restore to avoid a security
hole, but this can be done without accessing memory. For example, the
floating-point registers can all be initialized to the immediate value
0.
The FS and XS fields are read by the privileged code before saving the
context. The FS field is set directly by privileged code when resuming a
user context, while the XS field is set indirectly by writing to the
status register of the individual extensions. The status fields will
also be updated during execution of instructions, regardless of
privilege mode.
Extensions to the user-mode ISA often include additional user-mode
state, and this state can be considerably larger than the base integer
registers. The extensions might only be used for some applications, or
might only be needed for short phases within a single application. To
improve performance, the user-mode extension can define additional
instructions to allow user-mode software to return the unit to an
initial state or even to turn off the unit.
For example, a coprocessor might require to be configured before use and
can be “unconfigured” after use. The unconfigured state would be
represented as the Initial state for context save. If the same
application remains running between the unconfigure and the next
configure (which would set status to Dirty), there is no need to
actually reinitialize the state at the unconfigure instruction, as all
state is local to the user process, i.e., the Initial state may only
cause the coprocessor state to be initialized to a constant value at
context restore, not at every unconfigure.
Executing a user-mode instruction to disable a unit and place it into
the Off state will cause an illegal instruction exception to be raised
if any subsequent instruction tries to use the unit before it is turned
back on. A user-mode instruction to turn a unit on must also ensure the
unit’s state is properly initialized, as the unit might have been used
by another context meantime.
Changing the setting of FS has no effect on the contents of the
floating-point register state. In particular, setting FS=Off does not
destroy the state, nor does setting FS=Initial clear the contents.
Similarly, the setting of VS has no effect on the contents of the vector
register state. Other extensions, however, might not preserve state when
set to Off.
Implementations may choose to track the dirtiness of the floating-point
register state imprecisely by reporting the state to be dirty even when
it has not been modified. On some implementations, some instructions
that do not mutate the floating-point state may cause the state to
transition from Initial or Clean to Dirty. On other implementations,
dirtiness might not be tracked at all, in which case the valid FS states
are Off and Dirty, and an attempt to set FS to Initial or Clean causes
it to be set to Dirty.

This definition of FS does not disallow setting FS to Dirty as a result
of errant speculation. Some platforms may choose to disallow
speculatively writing FS to close a potential side channel.

If an instruction explicitly or implicitly writes a floating-point
register or the fcsr but does not alter its contents, and FS=Initial
or FS=Clean, it is implementation-defined whether FS transitions to
Dirty.
Implementations may choose to track the dirtiness of the vector register
state in an analogous imprecise fashion, including possibly setting VS
to Dirty when software attempts to set VS=Initial or VS=Clean. When
VS=Initial or VS=Clean, it is implementation-defined whether an
instruction that writes a vector register or vector CSR but does not
alter its contents causes VS to transition to Dirty.
Table [fsxsstates] shows all the possible
state transitions for the FS, VS, or XS status bits. Note that the
standard floating-point and vector extensions do not support user-mode
unconfigure or disable/enable instructions.


Current State
Off
Initial
Clean
Dirty


Action


At context save in privileged code


Save state?
No
No
No
Yes


Next state
Off
Initial
Clean
Clean


At context restore in privileged code


Restore state?
No
Yes, to initial
Yes, from memory
N/A


Next state
Off
Initial
Clean
N/A


Execute instruction to read state


Action?
Exception
Execute
Execute
Execute


Next state
Off
Initial
Clean
Dirty


Execute instruction that possibly modifies state, including configuration


Action?
Exception
Execute
Execute
Execute


Next state
Off
Dirty
Dirty
Dirty


Execute instruction to unconfigure unit


Action?
Exception
Execute
Execute
Execute


Next state
Off
Initial
Initial
Initial


Execute instruction to disable unit


Action?
Execute
Execute
Execute
Execute


Next state
Off
Off
Off
Off


Execute instruction to enable unit


Action?
Execute
Execute
Execute
Execute


Next state
Initial
Initial
Initial
Initial


Standard privileged instructions to initialize, save, and restore
extension state are provided to insulate privileged code from details of
the added extension state by treating the state as an opaque object.

Many coprocessor extensions are only used in limited contexts that
allows software to safely unconfigure or even disable units when done.
This reduces the context-switch overhead of large stateful coprocessors.
We separate out floating-point state from other extension state, as when
a floating-point unit is present the floating-point registers are part
of the standard calling convention, and so user-mode software cannot
know when it is safe to disable the floating-point unit.

The XS field provides a summary of all added extension state, but
additional microarchitectural bits might be maintained in the extension
to further reduce context save and restore overhead.
The SD bit is read-only and is set when either the FS, VS, or XS bits
encode a Dirty state (i.e., SD=((FS==11) OR (XS==11) OR (VS==11))). This
allows privileged code to quickly determine when no additional context
save is required beyond the integer register set and pc.
The floating-point unit state is always initialized, saved, and restored
using standard instructions (F, D, and/or Q), and privileged code must
be aware of FLEN to determine the appropriate space to reserve for each
f register.
Machine and Supervisor modes share a single copy of the FS, VS, and XS
bits. Supervisor-level software normally uses the FS, VS, and XS bits
directly to record the status with respect to the supervisor-level saved
context. Machine-level software must be more conservative in saving and
restoring the extension state in their corresponding version of the
context.

In any reasonable use case, the number of context switches between user
and supervisor level should far outweigh the number of context switches
to other privilege levels. Note that coprocessors should not require
their context to be saved and restored to service asynchronous
interrupts, unless the interrupt results in a user-level context swap.

Machine Trap-Vector Base-Address Register (mtvec)

The mtvec register is an MXLEN-bit  read/write register that holds
trap vector configuration, consisting of a vector base address (BASE)
and a vector mode (MODE).


| J | S |

|:-
| |

| MXLEN-2 | 2


The mtvec register must always be implemented, but can contain a
read-only value. If mtvec is writable, the set of values the register
may hold can vary by implementation. The value in the BASE field must
always be aligned on a 4-byte boundary, and the MODE setting may impose
additional alignment constraints on the value in the BASE field.

We allow for considerable flexibility in implementation of the trap
vector base address. On the one hand, we do not wish to burden low-end
implementations with a large number of state bits, but on the other
hand, we wish to allow flexibility for larger systems.


Value
Name
Description


0
Direct
All exceptions set pc to BASE.


1
Vectored
Asynchronous interrupts set pc to BASE+4×cause.


≥2
—
Reserved


The encoding of the MODE field is shown in
Table [mtvec-mode]. When MODE=Direct, all
traps into machine mode cause the pc to be set to the address in the
BASE field. When MODE=Vectored, all synchronous exceptions into machine
mode cause the pc to be set to the address in the BASE field, whereas
interrupts cause the pc to be set to the address in the BASE field
plus four times the interrupt cause number. For example, a machine-mode
timer interrupt (see Table [mcauses] on page ) causes the pc to be
set to BASE+0x1c.

When vectored interrupts are enabled, interrupt cause 0, which
corresponds to user-mode software interrupts, are vectored to the same
location as synchronous exceptions. This ambiguity does not arise in
practice, since user-mode software interrupts are either disabled or
delegated to user mode.

An implementation may have different alignment constraints for different
modes. In particular, MODE=Vectored may have stricter alignment
constraints than MODE=Direct.

Allowing coarser alignments in Vectored mode enables vectoring to be
implemented without a hardware adder circuit.


Reset and NMI vector locations are given in a platform specification.

Machine Trap Delegation Registers (medeleg and mideleg)

By default, all traps at any privilege level are handled in machine
mode, though a machine-mode handler can redirect traps back to the
appropriate level with the MRET instruction
(Section 1.3.2). To increase performance,
implementations can provide individual read/write bits within medeleg
and mideleg to indicate that certain exceptions and interrupts should
be processed directly by a lower privilege level. The machine exception
delegation register (medeleg) and machine interrupt delegation
register ( mideleg) are MXLEN-bit read/write registers.
In systems with S-mode, the medeleg and mideleg registers must
exist, and setting a bit in medeleg or mideleg will delegate the
corresponding trap, when occurring in S-mode or U-mode, to the S-mode
trap handler. In systems without S-mode, the medeleg and mideleg
registers should not exist.

In versions 1.9.1 and earlier , these registers existed but were
hardwired to zero in M-mode only, or M/U without N systems. There is no
reason to require they return zero in those cases, as the  misa
register indicates whether they exist.

When a trap is delegated to S-mode, the scause register is written
with the trap cause; the sepc register is written with the virtual
address of the instruction that took the trap; the stval register is
written with an exception-specific datum; the SPP field of mstatus is
written with the active privilege mode at the time of the trap; the SPIE
field of mstatus is written with the value of the SIE field at the
time of the trap; and the SIE field of mstatus is cleared. The
mcause, mepc, and mtval registers and the MPP and MPIE fields of
mstatus are not written.
An implementation can choose to subset the delegatable traps, with the
supported delegatable bits found by writing one to every bit location,
then reading back the value in medeleg or mideleg to see which bit
positions hold a one.
An implementation shall not have any bits of medeleg be read-only one,
i.e., any synchronous trap that can be delegated must support not being
delegated. Similarly, an implementation shall not fix as read-only one
any bits of mideleg corresponding to machine-level interrupts (but may
do so for lower-level interrupts).

Version 1.11 and earlier prohibited having any bits of mideleg be
read-only one. Platform standards may always add such restrictions.

Traps never transition from a more-privileged mode to a less-privileged
mode. For example, if M-mode has delegated illegal instruction
exceptions to S-mode, and M-mode software later executes an illegal
instruction, the trap is taken in M-mode, rather than being delegated to
S-mode. By contrast, traps may be taken horizontally. Using the same
example, if M-mode has delegated illegal instruction exceptions to
S-mode, and S-mode software later executes an illegal instruction, the
trap is taken in S-mode.
Delegated interrupts result in the interrupt being masked at the
delegator privilege level. For example, if the supervisor timer
interrupt (STI) is delegated to S-mode by setting mideleg[5], STIs
will not be taken when executing in M-mode. By contrast, if
mideleg[5] is clear, STIs can be taken in any mode and regardless of
current mode will transfer control to M-mode.


|  | U

|:-
MXLEN


medeleg has a bit position allocated for every synchronous exception
shown in Table [mcauses] on page , with the index of the
bit position equal to the value returned in the mcause register (i.e.,
setting bit 8 allows user-mode environment calls to be delegated to a
lower-privilege trap handler).


|  | U

|:-
MXLEN


mideleg holds trap delegation bits for individual interrupts, with the
layout of bits matching those in the mip register (i.e., STIP
interrupt delegation control is located in bit 5).
For exceptions that cannot occur in less privileged modes, the
corresponding medeleg bits should be read-only zero. In particular,
medeleg[11] is read-only zero.
Machine Interrupt Registers (mip and mie)

The mip register is an MXLEN-bit read/write register containing
information on pending interrupts, while mie is the corresponding
MXLEN-bit read/write register containing interrupt enable bits.
Interrupt cause number i (as reported in CSR mcause,
Section 1.1.15) corresponds with bit i in both
mip and mie. Bits 15:0 are allocated to standard interrupt causes
only, while bits 16 and above are designated for platform or custom use.


|  | U

|:-
MXLEN


|  | U

|:-
MXLEN


An interrupt i will trap to M-mode (causing the privilege mode to
change to M-mode) if all of the following are true: (a) either the
current privilege mode is M and the MIE bit in the mstatus register is
set, or the current privilege mode has less privilege than M-mode;
(b) bit i is set in both mip and mie; and (c) if register
mideleg exists, bit i is not set in mideleg.
These conditions for an interrupt trap to occur must be evaluated in a
bounded amount of time from when an interrupt becomes, or ceases to be,
pending in mip, and must also be evaluated immediately following the
execution of an xRET instruction or an explicit write to a CSR on
which these interrupt trap conditions expressly depend (including mip,
mie, mstatus, and mideleg).
Interrupts to M-mode take priority over any interrupts to lower
privilege modes.
Each individual bit in register mip may be writable or may be
read-only. When bit i in mip is writable, a pending interrupt i
can be cleared by writing 0 to this bit. If interrupt i can become
pending but bit i in mip is read-only, the implementation must
provide some other mechanism for clearing the pending interrupt.
A bit in mie must be writable if the corresponding interrupt can ever
become pending. Bits of mie that are not writable must be read-only
zero.
The standard portions (bits 15:0) of registers mip and mie are
formatted as shown in Figures
[mipreg-standard] and
[miereg-standard] respectively.


| Rcccccccccccc | | | | | | | | | | | |

| | | | | | | | | | | | |

| | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


| Rcccccccccccc | | | | | | | | | | | |

| | | | | | | | | | | | |

| | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


The machine-level interrupt registers handle a few root interrupt
sources which are assigned a fixed service priority for simplicity,
while separate external interrupt controllers can implement a more
complex prioritization scheme over a much larger set of interrupts that
are then muxed into the machine-level interrupt sources.


The non-maskable interrupt is not made visible via the mip register as
its presence is implicitly known when executing the NMI trap handler.

Bits mip.MEIP and mie.MEIE are the interrupt-pending and
interrupt-enable bits for machine-level external interrupts. MEIP is
read-only in mip, and is set and cleared by a platform-specific
interrupt controller.
Bits mip.MTIP and mie.MTIE are the interrupt-pending and
interrupt-enable bits for machine timer interrupts. MTIP is read-only in
mip, and is cleared by writing to the memory-mapped machine-mode timer
compare register.
Bits mip.MSIP and mie.MSIE are the interrupt-pending and
interrupt-enable bits for machine-level software interrupts. MSIP is
read-only in mip, and is written by accesses to memory-mapped control
registers, which are used by remote harts to provide machine-level
interprocessor interrupts. A hart can write its own MSIP bit using the
same memory-mapped control register. If a system has only one hart, or
if a platform standard supports the delivery of machine-level
interprocessor interrupts through external interrupts (MEI) instead,
then mip.MSIP and mie.MSIE may both be read-only zeros.
If supervisor mode is not implemented, bits SEIP, STIP, and SSIP of
mip and SEIE, STIE, and SSIE of mie are read-only zeros.
If supervisor mode is implemented, bits mip.SEIP and mie.SEIE are
the interrupt-pending and interrupt-enable bits for supervisor-level
external interrupts. SEIP is writable in mip, and may be written by
M-mode software to indicate to S-mode that an external interrupt is
pending. Additionally, the platform-level interrupt controller may
generate supervisor-level external interrupts. Supervisor-level external
interrupts are made pending based on the logical-OR of the
software-writable SEIP bit and the signal from the external interrupt
controller. When mip is read with a CSR instruction, the value of the
SEIP bit returned in the rd destination register is the logical-OR of
the software-writable bit and the interrupt signal from the interrupt
controller, but the signal from the interrupt controller is not used to
calculate the value written to SEIP. Only the software-writable SEIP bit
participates in the read-modify-write sequence of a CSRRS or CSRRC
instruction.

For example, if we name the software-writable SEIP bit B and the
signal from the external interrupt controller E, then if
csrrs t0, mip, t1 is executed, t0[9] is written with B || E, then
B is written with B || t1[9]. If csrrw t0, mip, t1 is executed,
then t0[9] is written with B || E, and B is simply written with
t1[9]. In neither case does B depend upon E.
The SEIP field behavior is designed to allow a higher privilege layer to
mimic external interrupts cleanly, without losing any real external
interrupts. The behavior of the CSR instructions is slightly modified
from regular CSR accesses as a result.

If supervisor mode is implemented, bits mip.STIP and mie.STIE are
the interrupt-pending and interrupt-enable bits for supervisor-level
timer interrupts. STIP is writable in mip, and may be written by
M-mode software to deliver timer interrupts to S-mode.
If supervisor mode is implemented, bits mip.SSIP and mie.SSIE are
the interrupt-pending and interrupt-enable bits for supervisor-level
software interrupts. SSIP is writable in mip and may also be set to 1
by a platform-specific interrupt controller.
Multiple simultaneous interrupts destined for M-mode are handled in the
following decreasing priority order: MEI, MSI, MTI, SEI, SSI, STI.

The machine-level interrupt fixed-priority ordering rules were developed
with the following rationale.
Interrupts for higher privilege modes must be serviced before interrupts
for lower privilege modes to support preemption.
The platform-specific machine-level interrupt sources in bits 16 and
above have platform-specific priority, but are typically chosen to have
the highest service priority to support very fast local vectored
interrupts.
External interrupts are handled before internal (timer/software)
interrupts as external interrupts are usually generated by devices that
might require low interrupt service times.
Software interrupts are handled before internal timer interrupts,
because internal timer interrupts are usually intended for time slicing,
where time precision is less important, whereas software interrupts are
used for inter-processor messaging. Software interrupts can be avoided
when high-precision timing is required, or high-precision timer
interrupts can be routed via a different interrupt path. Software
interrupts are located in the lowest four bits of mip as these are
often written by software, and this position allows the use of a single
CSR instruction with a five-bit immediate.

Restricted views of the mip and mie registers appear as the sip
and sie registers for supervisor level. If an interrupt is delegated
to S-mode by setting a bit in the mideleg register, it becomes visible
in the sip register and is maskable using the sie register.
Otherwise, the corresponding bits in sip and sie are read-only zero.
Hardware Performance Monitor

M-mode includes a basic hardware performance-monitoring facility. The
mcycle CSR counts the number of clock cycles executed by the processor
core on which the hart is running. The minstret CSR counts the number
of instructions the hart has retired. The mcycle and minstret
registers have 64-bit precision on all RV32 and RV64 systems.
The counter registers have an arbitrary value after the hart is reset,
and can be written with a given value. Any CSR write takes effect after
the writing instruction has otherwise completed. The mcycle CSR may be
shared between harts on the same core, in which case writes to mcycle
will be visible to those harts. The platform should provide a mechanism
to indicate which harts share an  mcycle CSR.
The hardware performance monitor includes 29 additional 64-bit event
counters,  mhpmcounter3–mhpmcounter31. The event selector CSRs,
 mhpmevent3–mhpmevent31, are MXLEN-bit  registers that control which
event causes the corresponding counter to increment. The meaning of
these events is defined by the platform, but event 0 is defined to mean
“no event.” All counters should be implemented, but a legal
implementation is to make both the counter and its corresponding event
selector be read-only 0.


|  | K | W | K

|:- |:- |:-
| | |

| | |

| | |

| | |

| | |

| | |

| 64 | | MXLEN


The mhpmcounters are  registers that support up to 64 bits of
precision on RV32 and RV64.

A future revision of this specification will define a mechanism to
generate an interrupt when a hardware performance monitor counter
overflows.

When MXLEN=32, reads of the mcycle, minstret, and  mhpmcountern
CSRs return bits 31–0 of the corresponding counter, and writes change
only bits 31–0; reads of the mcycleh, minstreth, and mhpmcounternh
CSRs return bits 63–32 of the corresponding counter, and writes change
only bits 63–32.


|  | K

|:-


Machine Counter-Enable Register (mcounteren)

The counter-enable register mcounteren is a 32-bit register that
controls the availability of the hardware performance-monitoring
counters to the next-lowest privileged mode.


| cccMcccccc | | | | | | | | |

| | | | | | | | | |

| | 1 | 1 | 23 | 1 | 1 | 1 | 1 | 1 | 1


The settings in this register only control accessibility. The act of
reading or writing this register does not affect the underlying
counters, which continue to increment even when not accessible.
When the CY, TM, IR, or HPMn bit in the mcounteren register is
clear, attempts to read the cycle, time,  instret, or
hpmcountern register while executing in S-mode or U-mode will cause an
illegal instruction exception. When one of these bits is set, access to
the corresponding register is permitted in the next implemented
privilege mode (S-mode if implemented, otherwise U-mode).

The counter-enable bits support two common use cases with minimal
hardware. For systems that do not need high-performance timers and
counters, machine-mode software can trap accesses and implement all
features in software. For systems that need high-performance timers and
counters but are not concerned with obfuscating the underlying hardware
counters, the counters can be directly exposed to lower privilege modes.

The cycle, instret, and hpmcountern CSRs are read-only shadows of
mcycle, minstret, and mhpmcounter n, respectively. The time CSR
is a read-only shadow of the memory-mapped mtime register.
Analogously, on RV32I the cycleh, instreth and hpmcounternh CSRs
are read-only shadows of mcycleh, minstreth and mhpmcounternh,
respectively. On RV32I the timeh CSR is a read-only shadow of the
upper 32 bits of the memory-mapped mtime register, while time
shadows only the lower 32 bits of mtime.

Implementations can convert reads of the time and timeh CSRs into
loads to the memory-mapped mtime register, or emulate this
functionality in M-mode software.

In systems with U-mode, the mcounteren must be implemented, but all
fields are  and may be read-only zero, indicating reads to the
corresponding counter will cause an illegal instruction exception when
executing in a less-privileged mode. In systems without U-mode, the
mcounteren register should not exist.
Machine Counter-Inhibit CSR (mcountinhibit)


| cccMcccccc | | | | | | | | |

| | | | | | | | | |

| | 1 | 1 | 23 | 1 | 1 | 1 | 1 | 1 | 1


The counter-inhibit register mcountinhibit is a 32-bit  register that
controls which of the hardware performance-monitoring counters
increment. The settings in this register only control whether the
counters increment; their accessibility is not affected by the setting
of this register.
When the CY, IR, or HPMn bit in the mcountinhibit register is clear,
the cycle, instret, or hpmcountern register increments as usual.
When the CY, IR, or HPMn bit is set, the corresponding counter does
not increment.
The mcycle CSR may be shared between harts on the same core, in which
case the mcountinhibit.CY field is also shared between those harts,
and so writes to mcountinhibit.CY will be visible to those harts.
If the mcountinhibit register is not implemented, the implementation
behaves as though the register were set to zero.

When the cycle and instret counters are not needed, it is desirable
to conditionally inhibit them to reduce energy consumption. Providing a
single CSR to inhibit all counters also allows the counters to be
atomically sampled.
Because the time counter can be shared between multiple cores, it
cannot be inhibited with the mcountinhibit mechanism.

Machine Scratch Register (mscratch)

The mscratch register is an MXLEN-bit read/write register dedicated
for use by machine mode. Typically, it is used to hold a pointer to a
machine-mode hart-local context space and swapped with a user register
upon entry to an M-mode trap handler.


|  | J

|:-
MXLEN


The MIPS ISA allocated two user registers (k0/k1) for use by the
operating system. Although the MIPS scheme provides a fast and simple
implementation, it also reduces available user registers, and does not
scale to further privilege levels, or nested traps. It can also require
both registers are cleared before returning to user level to avoid a
potential security hole and to provide deterministic debugging behavior.
The RISC-V user ISA was designed to support many possible privileged
system environments and so we did not want to infect the user-level ISA
with any OS-dependent features. The RISC-V CSR swap instructions can
quickly save/restore values to the mscratch register. Unlike the MIPS
design, the OS can rely on holding a value in the  mscratch register
while the user context is running.

Machine Exception Program Counter (mepc)

mepc is an MXLEN-bit read/write register formatted as shown in
Figure [mepcreg]. The low bit of mepc
(mepc[0]) is always zero. On implementations that support only
IALIGN=32, the two low bits (mepc[1:0]) are always zero.
If an implementation allows IALIGN to be either 16 or 32 (by changing
CSR misa, for example), then, whenever IALIGN=32, bit mepc[1] is
masked on reads so that it appears to be 0. This masking occurs also for
the implicit read by the MRET instruction. Though masked, mepc[1]
remains writable when IALIGN=32.
mepc is a  register that must be able to hold all valid virtual
addresses. It need not be capable of holding all possible invalid
addresses. Prior to writing mepc, implementations may convert an
invalid address into some other invalid address that mepc is capable
of holding.

When address translation is not in effect, virtual addresses and
physical addresses are equal. Hence, the set of addresses mepc must be
able to represent includes the set of physical addresses that can be
used as a valid pc or effective address.

When a trap is taken into M-mode, mepc is written with the virtual
address of the instruction that was interrupted or that encountered the
exception. Otherwise, mepc is never written by the implementation,
though it may be explicitly written by software.


|  | J

|:-
MXLEN


Machine Cause Register (mcause)

The mcause register is an MXLEN-bit read-write register formatted as
shown in Figure [mcausereg]. When a trap is taken into
M-mode,  mcause is written with a code indicating the event that
caused the trap. Otherwise, mcause is never written by the
implementation, though it may be explicitly written by software.
The Interrupt bit in the mcause register is set if the trap was caused
by an interrupt. The Exception Code field contains a code identifying
the last exception or interrupt.
Table [mcauses] lists the possible machine-level
exception codes. The Exception Code is a  field, so is only guaranteed
to hold supported exception codes.


| c | U |

|:-
| |

| | MXLEN-1


Interrupt
Exception Code
Description


1
0
Reserved


1
1
Supervisor software interrupt


1
2
Reserved


1
3
Machine software interrupt


1
4
Reserved


1
5
Supervisor timer interrupt


1
6
Reserved


1
7
Machine timer interrupt


1
8
Reserved


1
9
Supervisor external interrupt


1
10
Reserved


1
11
Machine external interrupt


1
12–15
Reserved


1
≥16
Designated for platform use


0
0
Instruction address misaligned


0
1
Instruction access fault


0
2
Illegal instruction


0
3
Breakpoint


0
4
Load address misaligned


0
5
Load access fault


0
6
Store/AMO address misaligned


0
7
Store/AMO access fault


0
8
Environment call from U-mode


0
9
Environment call from S-mode


0
10
Reserved


0
11
Environment call from M-mode


0
12
Instruction page fault


0
13
Load page fault


0
14
Reserved


0
15
Store/AMO page fault


0
16–23
Reserved


0
24–31
Designated for custom use


0
32–47
Reserved


0
48–63
Designated for custom use


0
≥64
Reserved


Note that load and load-reserved instructions generate load exceptions,
whereas store, store-conditional, and AMO instructions generate
store/AMO exceptions.

Interrupts can be separated from other traps with a single branch on the
sign of the mcause register value. A shift left can remove the
interrupt bit and scale the exception codes to index into a trap vector
table.


We do not distinguish privileged instruction exceptions from illegal
opcode exceptions. This simplifies the architecture and also hides
details of which higher-privilege instructions are supported by an
implementation. The privilege level servicing the trap can implement a
policy on whether these need to be distinguished, and if so, whether a
given opcode should be treated as illegal or privileged.

If an instruction may raise multiple synchronous exceptions, the
decreasing priority order of
Table [exception-priority] indicates
which exception is taken and reported in mcause. The priority of any
custom synchronous exceptions is implementation-defined.


Priority
Exc.Code
Description


Highest
3
Instruction address breakpoint


During instruction address translation:


12, 1
First encountered page fault or access fault


With physical address for instruction:


1
Instruction access fault


2
Illegal instruction


0
Instruction address misaligned


8, 9, 11
Environment call


3
Environment break


3
Load/store/AMO address breakpoint


Optionally:


4, 6
Load/store/AMO address misaligned


During address translation for an explicit memory access:


13, 15, 5, 7
First encountered page fault or access fault


With physical address for an explicit memory access:


5, 7
Load/store/AMO access fault


If not higher priority:


Lowest
4, 6
Load/store/AMO address misaligned


When a virtual address is translated into a physical address, the
address translation algorithm determines what specific exception may be
raised.
Load/store/AMO address-misaligned exceptions may have either higher or
lower priority than load/store/AMO page-fault and access-fault
exceptions.

The relative priority of load/store/AMO address-misaligned and
page-fault exceptions is implementation-defined to flexibly cater to two
design points. Implementations that never support misaligned accesses
can unconditionally raise the misaligned-address exception without
performing address translation or protection checks. Implementations
that support misaligned accesses only to some physical addresses must
translate and check the address before determining whether the
misaligned access may proceed, in which case raising the page-fault
exception or access is more appropriate.


Instruction address breakpoints have the same cause value as, but
different priority than, data address breakpoints (a.k.a. watchpoints)
and environment break exceptions (which are raised by the EBREAK
instruction).


Instruction address misaligned exceptions are raised by control-flow
instructions with misaligned targets, rather than by the act of fetching
an instruction. Therefore, these exceptions have lower priority than
other instruction address exceptions.

Machine Trap Value Register (mtval)

The mtval register is an MXLEN-bit read-write register formatted as
shown in Figure [mtvalreg]. When a trap is taken into
M-mode, mtval is either set to zero or written with exception-specific
information to assist software in handling the trap. Otherwise, mtval
is never written by the implementation, though it may be explicitly
written by software. The hardware platform will specify which exceptions
must set mtval informatively and which may unconditionally set it to
zero. If the hardware platform specifies that no exceptions set mtval
to a nonzero value, then mtval is read-only zero.
If mtval is written with a nonzero value when a breakpoint,
address-misaligned, access-fault, or page-fault exception occurs on an
instruction fetch, load, or store, then mtval will contain the
faulting virtual address.

When page-based virtual memory is enabled, mtval is written with the
faulting virtual address, even for physical-memory access-fault
exceptions. This design reduces datapath cost for most implementations,
particularly those with hardware page-table walkers.


|  | J

|:-
MXLEN


If mtval is written with a nonzero value when a misaligned load or
store causes an access-fault or page-fault exception, then mtval will
contain the virtual address of the portion of the access that caused the
fault.
If mtval is written with a nonzero value when an instruction
access-fault or page-fault exception occurs on a system with
variable-length instructions, then mtval will contain the virtual
address of the portion of the instruction that caused the fault, while
mepc will point to the beginning of the instruction.
The mtval register can optionally also be used to return the faulting
instruction bits on an illegal instruction exception (mepc points to
the faulting instruction in memory). If mtval is written with a
nonzero value when an illegal-instruction exception occurs, then mtval
will contain the shortest of:

the actual faulting instruction
the first ILEN bits of the faulting instruction
the first MXLEN bits of the faulting instruction

The value loaded into mtval on an illegal-instruction exception is
right-justified and all unused upper bits are cleared to zero.

Capturing the faulting instruction in mtval reduces the overhead of
instruction emulation, potentially avoiding several partial instruction
loads if the instruction is misaligned, and likely data cache misses or
slow uncached accesses when loads are used to fetch the instruction into
a data register. There is also a problem of atomicity if another agent
is manipulating the instruction memory, as might occur in a dynamic
translation system.
A requirement is that the entire instruction (or at least the first
MXLEN bits) are fetched into mtval before taking the trap. This should
not constrain implementations, which would typically fetch the entire
instruction before attempting to decode the instruction, and avoids
complicating software handlers.
A value of zero in mtval signifies either that the feature is not
supported, or an illegal zero instruction was fetched. A load from the
instruction memory pointed to by mepc can be used to distinguish these
two cases (or alternatively, the system configuration information can be
interrogated to install the appropriate trap handling before runtime).

For other traps, mtval is set to zero, but a future standard may
redefine mtval’s setting for other traps.
If mtval is not read-only zero, it is a  register that must be able to
hold all valid virtual addresses and the value zero. It need not be
capable of holding all possible invalid addresses. Prior to writing
mtval, implementations may convert an invalid address into some other
invalid address that mtval is capable of holding. If the feature to
return the faulting instruction bits is implemented,  mtval must also
be able to hold all values less than 2^N, where N is the
smaller of MXLEN and ILEN.
Machine Configuration Pointer Register (mconfigptr)

mconfigptr is an MXLEN-bit read-only CSR, formatted as shown in
Figure [mconfigptrreg], that holds the
physical address of a configuration data structure. Software can
traverse this data structure to discover information about the harts,
the platform, and their configuration.


|  | J

|:-
MXLEN


The pointer alignment in bits must be no smaller than the greatest
supported MXLEN: i.e., if the greatest supported MXLEN is 8 × n, then
mconfigptr[log₂n-1:0] must be zero.
mconfigptr must be implemented, but it may be zero to indicate the
configuration data structure does not exist or that an alternative
mechanism must be used to locate it.

The format and schema of the configuration data structure have yet to be
standardized.


While mconfigptr will simply be hardwired in some implementations,
other implementations may provide a means to configure the value
returned on CSR reads. For example, mconfigptr might present the value
of a memory-mapped register that is programmed by the platform or by
M-mode software towards the beginning of the boot process.

Machine Environment Configuration Registers (menvcfg and menvcfgh)

The menvcfg CSR is an MXLEN-bit read/write register, formatted for
MXLEN=64 as shown in
Figure [fig:menvcfg], that controls certain
characteristics of the execution environment for modes less privileged
than M.


| cc | Mcc | W | Wc | | | | | | |

|:- |:- |:-
| | | | | | | |

| | 1 | 54 | 1 | 1 | 2 | 3 | 1


If bit FIOM (Fence of I/O implies Memory) is set to one in menvcfg,
FENCE instructions executed in modes less privileged than M are modified
so the requirement to order accesses to device I/O implies also the
requirement to order main memory accesses.
Table 1.1 details the modified
interpretation of FENCE instruction bits PI, PO, SI, and SO for modes
less privileged than M when FIOM=1.
Similarly, for modes less privileged than M when FIOM=1, if an atomic
instruction that accesses a region ordered as device I/O has its aq
and/or rl bit set, then that instruction is ordered as though it
accesses both device I/O and memory.
If S-mode is not supported, or if satp.MODE is read-only zero (always
Bare), the implementation may make FIOM read-only zero.


Instruction bit
Meaning when set


PI
Predecessor device input and memory reads (PR implied)


PO
Predecessor device output and memory writes (PW implied)


SI
Successor device input and memory reads (SR implied)


SO
Successor device output and memory writes (SW implied)


Modified interpretation of FENCE predecessor and successor sets for
modes less privileged than M when FIOM=1.


Bit FIOM is needed in menvcfg so M-mode can emulate the hypervisor
extension of Chapter [hypervisor], which has an equivalent
FIOM bit in the hypervisor CSR henvcfg.

The PBMTE bit controls whether the Svpbmt extension is available for use
in S-mode and G-stage address translation (i.e., for page tables pointed
to by satp or hgatp). When PBMTE=1, Svpbmt is available for S-mode
and G-stage address translation. When PBMTE=0, the implementation
behaves as though Svpbmt were not implemented. If Svpbmt is not
implemented, PBMTE is read-only zero. Furthermore, for implementations
with the hypervisor extension, henvcfg.PBMTE is read-only zero if
menvcfg.PBMTE is zero.
The definition of the STCE field will be furnished by the forthcoming
Sstc extension. Its allocation within menvcfg may change prior to the
ratification of that extension.
The definition of the CBZE field will be furnished by the forthcoming
Zicboz extension. Its allocation within menvcfg may change prior to
the ratification of that extension.
The definitions of the CBCFE and CBIE fields will be furnished by the
forthcoming Zicbom extension. Their allocations within menvcfg may
change prior to the ratification of that extension.
When MXLEN=32, menvcfg contains the same fields as bits 31:0 of
menvcfg when MXLEN=64. Additionally, when MXLEN=32, menvcfgh is a
32-bit read/write register that contains the same fields as bits 63:32
of menvcfg when MXLEN=64. Register menvcfgh does not exist when
MXLEN=64.
If U-mode is not supported, then registers menvcfg and menvcfgh do
not exist.
Machine Security Configuration Register (mseccfg)

mseccfg is an optional MXLEN-bit read/write register, formatted as
shown in Figure [fig:mseccfg], that controls security
features.
When MXLEN=32 only, mseccfgh is a 32-bit read/write register that
contains the same fields as mseccfg bits 63:32 when MXLEN=64.


| MccFccc | | | | | |

| | | | | | |

| XLEN-10 | 1 | 1 | 5 | 1 | 1 | 1


The definitions of the SSEED and USEED fields will be furnished by the
forthcoming entropy-source extension, Zkr. Their allocations within
mseccfg may change prior to the ratification of that extension.
The definitions of the RLB, MMWP, and MML fields will be furnished by
the forthcoming PMP-enhancement extension, Smepmp. Their allocations
within mseccfg may change prior to the ratification of that extension.
Machine-Level Memory-Mapped Registers

Machine Timer Registers (mtime and mtimecmp)

Platforms provide a real-time counter, exposed as a memory-mapped
machine-mode read-write register, mtime. mtime must increment at
constant frequency, and the platform must provide a mechanism for
determining the period of an mtime tick. The  mtime register will
wrap around if the count overflows.
The mtime register has a 64-bit precision on all RV32 and RV64
systems. Platforms provide a 64-bit memory-mapped machine-mode timer
compare register (mtimecmp). A machine timer interrupt becomes pending
whenever mtime contains a value greater than or equal to mtimecmp,
treating the values as unsigned integers. The interrupt remains posted
until mtimecmp becomes greater than mtime (typically as a result of
writing mtimecmp). The interrupt will only be taken if interrupts are
enabled and the MTIE bit is set in the mie register.


|  | J

|:-


|  | J

|:-


The timer facility is defined to use wall-clock time rather than a cycle
counter to support modern processors that run with a highly variable
clock frequency to save energy through dynamic voltage and frequency
scaling.
Accurate real-time clocks (RTCs) are relatively expensive to provide
(requiring a crystal or MEMS oscillator) and have to run even when the
rest of system is powered down, and so there is usually only one in a
system located in a different frequency/voltage domain from the
processors. Hence, the RTC must be shared by all the harts in a system
and accesses to the RTC will potentially incur the penalty of a
voltage-level-shifter and clock-domain crossing. It is thus more natural
to expose mtime as a memory-mapped register than as a CSR.
Lower privilege levels do not have their own timecmp registers.
Instead, machine-mode software can implement any number of virtual
timers on a hart by multiplexing the next timer interrupt into the
mtimecmp register.
Simple fixed-frequency systems can use a single clock for both cycle
counting and wall-clock time.

Writes to mtime and mtimecmp are guaranteed to be reflected in MTIP
eventually, but not necessarily immediately.

A spurious timer interrupt might occur if an interrupt handler
increments  mtimecmp then immediately returns, because MTIP might not
yet have fallen in the interim. All software should be written to assume
this event is possible, but most software should assume this event is
extremely unlikely. It is almost always more performant to incur an
occasional spurious timer interrupt than to poll MTIP until it falls.

In RV32, memory-mapped writes to mtimecmp modify only one 32-bit part
of the register. The following code sequence sets a 64-bit  mtimecmp
value without spuriously generating a timer interrupt due to the
intermediate value of the comparand:

            # New comparand is in a1:a0.
            li t0, -1
            la t1, mtimecmp
            sw t0, 0(t1)     # No smaller than old value.
            sw a1, 4(t1)     # No smaller than new value.
            sw a0, 0(t1)     # New value.


For RV64, naturally aligned 64-bit memory accesses to the mtime and
 mtimecmp registers are additionally supported and are atomic.
Machine-Mode Privileged Instructions

Environment Call and Breakpoint


| M | R | F | R | S

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| ECALL | 0 | PRIV | 0 | SYSTEM

| EBREAK | 0 | PRIV | 0 | SYSTEM


The ECALL instruction is used to make a request to the supporting
execution environment. When executed in U-mode, S-mode, or M-mode, it
generates an environment-call-from-U-mode exception,
environment-call-from-S-mode exception, or environment-call-from-M-mode
exception, respectively, and performs no other operation.

ECALL generates a different exception for each originating privilege
mode so that environment call exceptions can be selectively delegated. A
typical use case for Unix-like operating systems is to delegate to
S-mode the environment-call-from-U-mode exception but not the others.

The EBREAK instruction is used by debuggers to cause control to be
transferred back to a debugging environment. It generates a breakpoint
exception and performs no other operation.

As described in the “C” Standard Extension for Compressed Instructions
in Volume I of this manual, the C.EBREAK instruction performs the same
operation as the EBREAK instruction.

ECALL and EBREAK cause the receiving privilege mode’s epc register to
be set to the address of the ECALL or EBREAK instruction itself, not
the address of the following instruction. As ECALL and EBREAK cause
synchronous exceptions, they are not considered to retire, and should
not increment the minstret CSR.
Trap-Return Instructions

Instructions to return from trap are encoded under the PRIV minor
opcode.


| M | R | F | R | S

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| MRET/SRET | 0 | PRIV | 0 | SYSTEM


To return after handling a trap, there are separate trap return
instructions per privilege level, MRET and SRET. MRET is always
provided. SRET must be provided if supervisor mode is supported, and
should raise an illegal instruction exception otherwise. SRET should
also raise an illegal instruction exception when TSR=1 in mstatus, as
described in Section 1.1.6.5. An xRET instruction can be
executed in privilege mode x or higher, where executing a
lower-privilege xRET instruction will pop the relevant lower-privilege
interrupt enable and privilege mode stack. In addition to manipulating
the privilege stack as described in
Section 1.1.6.1, xRET sets the pc to the
value stored in the xepc register.
If the A extension is supported, the xRET instruction is allowed to
clear any outstanding LR address reservation but is not required to.
Trap handlers should explicitly clear the reservation if required (e.g.,
by using a dummy SC) before executing the xRET.

If xRET instructions always cleared LR reservations, it would be
impossible to single-step through LR/SC sequences using a debugger.

Wait for Interrupt

The Wait for Interrupt instruction (WFI) provides a hint to the
implementation that the current hart can be stalled until an interrupt
might need servicing. Execution of the WFI instruction can also be used
to inform the hardware platform that suitable interrupts should
preferentially be routed to this hart. WFI is available in all
privileged modes, and optionally available to U-mode. This instruction
may raise an illegal instruction exception when TW=1 in mstatus, as
described in Section 1.1.6.5.


| M | R | F | R | S

|:- |:- |:- |:-
| | | | |

| | | | |

| | 5 | 3 | 5 | 7

| WFI | 0 | PRIV | 0 | SYSTEM


If an enabled interrupt is present or later becomes present while the
hart is stalled, the interrupt trap will be taken on the following
instruction, i.e., execution resumes in the trap handler and mepc =
pc + 4.

The following instruction takes the interrupt trap so that a simple
return from the trap handler will execute code after the WFI
instruction.

The purpose of the WFI instruction is to provide a hint to the
implementation, and so a legal implementation is to simply implement WFI
as a NOP.

If the implementation does not stall the hart on execution of the
instruction, then the interrupt will be taken on some instruction in the
idle loop containing the WFI, and on a simple return from the handler,
the idle loop will resume execution.

The WFI instruction can also be executed when interrupts are disabled.
The operation of WFI must be unaffected by the global interrupt bits in
 mstatus (MIE and SIE) and the delegation register mideleg (i.e.,
the hart must resume if a locally enabled interrupt becomes pending,
even if it has been delegated to a less-privileged mode), but should
honor the individual interrupt enables (e.g, MTIE) (i.e.,
implementations should avoid resuming the hart if the interrupt is
pending but not individually enabled). WFI is also required to resume
execution for locally enabled interrupts pending at any privilege level,
regardless of the global interrupt enable at each privilege level.
If the event that causes the hart to resume execution does not cause an
interrupt to be taken, execution will resume at pc + 4, and software
must determine what action to take, including looping back to repeat the
WFI if there was no actionable event.

By allowing wakeup when interrupts are disabled, an alternate entry
point to an interrupt handler can be called that does not require saving
the current context, as the current context can be saved or discarded
before the WFI is executed.
As implementations are free to implement WFI as a NOP, software must
explicitly check for any relevant pending but disabled interrupts in the
code following an WFI, and should loop back to the WFI if no suitable
interrupt was detected. The mip or sip registers can be interrogated
to determine the presence of any interrupt in machine or supervisor mode
respectively.
The operation of WFI is unaffected by the delegation register settings.
WFI is defined so that an implementation can trap into a higher
privilege mode, either immediately on encountering the WFI or after some
interval to initiate a machine-mode transition to a lower power state,
for example.


The same “wait-for-event” template might be used for possible future
extensions that wait on memory locations changing, or message arrival.

Custom SYSTEM Instructions

The subspace of the SYSTEM major opcode shown in
Figure [fig:customsys] is designated for
custom use. It is recommended that these instructions use bits 29:28 to
designate the minimum required privilege mode, as do other SYSTEM
instructions.


| Y | S | F | Y | Rc

|:- |:- |:- |:-
| | | | |

| | | | | | Recommended Purpose

| 6 | 11 | 3 | 5 | 7

| 100011 | custom | 0 | custom | SYSTEM | Unprivileged or User-Level

| 110011 | custom | 0 | custom | SYSTEM | Unprivileged or User-Level

| 100111 | custom | 0 | custom | SYSTEM | Supervisor-Level

| 110111 | custom | 0 | custom | SYSTEM | Supervisor-Level

| 101011 | custom | 0 | custom | SYSTEM | Hypervisor-Level

| 111011 | custom | 0 | custom | SYSTEM | Hypervisor-Level

| 101111 | custom | 0 | custom | SYSTEM | Machine-Level

| 111111 | custom | 0 | custom | SYSTEM | Machine-Level


Reset

Upon reset, a hart’s privilege mode is set to M. The mstatus fields
MIE and MPRV are reset to 0. If little-endian memory accesses are
supported, the mstatus/mstatush field MBE is reset to 0. The misa
register is reset to enable the maximal set of supported extensions and
widest MXLEN, as described in
Section 1.1.1. For implementations with the “A”
standard extension, there is no valid load reservation. The pc is set
to an implementation-defined reset vector. The mcause register is set
to a value indicating the cause of the reset. Writable PMP registers’ A
and L fields are set to 0, unless the platform mandates a different
reset value for some PMP registers’ A and L fields. If the hypervisor
extension is implemented, the hgatp.MODE and vsatp.MODE fields are
reset to 0. If the Smrnmi extension is implemented, the mnstatus.NMIE
field is reset to 0. No  field contains an illegal value. All other hart
state is .
The mcause values after reset have implementation-specific
interpretation, but the value 0 should be returned on implementations
that do not distinguish different reset conditions. Implementations that
distinguish different reset conditions should only use 0 to indicate the
most complete reset.

Some designs may have multiple causes of reset (e.g., power-on reset,
external hard reset, brownout detected, watchdog timer elapse,
sleep-mode wakeup), which machine-mode software and debuggers may wish
to distinguish.
mcause reset values may alias mcause values following synchronous
exceptions. There should be no ambiguity in this overlap, since on reset
the pc is typically set to a different value than on other traps.

Non-Maskable Interrupts

Non-maskable interrupts (NMIs) are only used for hardware error
conditions, and cause an immediate jump to an implementation-defined NMI
vector running in M-mode regardless of the state of a hart’s interrupt
enable bits. The mepc register is written with the virtual address of
the instruction that was interrupted, and mcause is set to a value
indicating the source of the NMI. The NMI can thus overwrite state in an
active machine-mode interrupt handler.
The values written to mcause on an NMI are implementation-defined. The
high Interrupt bit of mcause should be set to indicate that this was
an interrupt. An Exception Code of 0 is reserved to mean “unknown cause”
and implementations that do not distinguish sources of NMIs via the
mcause register should return 0 in the Exception Code.
Unlike resets, NMIs do not reset processor state, enabling diagnosis,
reporting, and possible containment of the hardware error.
Physical Memory Attributes

The physical memory map for a complete system includes various address
ranges, some corresponding to memory regions, some to memory-mapped
control registers, and some to vacant holes in the address space. Some
memory regions might not support reads, writes, or execution; some might
not support subword or subblock accesses; some might not support atomic
operations; and some might not support cache coherence or might have
different memory models. Similarly, memory-mapped control registers vary
in their supported access widths, support for atomic operations, and
whether read and write accesses have associated side effects. In RISC-V
systems, these properties and capabilities of each region of the
machine’s physical address space are termed physical memory attributes
(PMAs). This section describes RISC-V PMA terminology and how RISC-V
systems implement and check PMAs.
PMAs are inherent properties of the underlying hardware and rarely
change during system operation. Unlike physical memory protection values
described in Section 1.7, PMAs do not vary by execution context.
The PMAs of some memory regions are fixed at chip design time—for
example, for an on-chip ROM. Others are fixed at board design time,
depending, for example, on which other chips are connected to off-chip
buses. Off-chip buses might also support devices that could be changed
on every power cycle (cold pluggable) or dynamically while the system is
running (hot pluggable). Some devices might be configurable at run time
to support different uses that imply different PMAs—for example, an
on-chip scratchpad RAM might be cached privately by one core in one
end-application, or accessed as a shared non-cached memory in another
end-application.
Most systems will require that at least some PMAs are dynamically
checked in hardware later in the execution pipeline after the physical
address is known, as some operations will not be supported at all
physical memory addresses, and some operations require knowing the
current setting of a configurable PMA attribute. While many other
architectures specify some PMAs in the virtual memory page tables and
use the TLB to inform the pipeline of these properties, this approach
injects platform-specific information into a virtualized layer and can
cause system errors unless attributes are correctly initialized in each
page-table entry for each physical memory region. In addition, the
available page sizes might not be optimal for specifying attributes in
the physical memory space, leading to address-space fragmentation and
inefficient use of expensive TLB entries.
For RISC-V, we separate out specification and checking of PMAs into a
separate hardware structure, the PMA checker. In many cases, the
attributes are known at system design time for each physical address
region, and can be hardwired into the PMA checker. Where the attributes
are run-time configurable, platform-specific memory-mapped control
registers can be provided to specify these attributes at a granularity
appropriate to each region on the platform (e.g., for an on-chip SRAM
that can be flexibly divided between cacheable and uncacheable uses).
PMAs are checked for any access to physical memory, including accesses
that have undergone virtual to physical memory translation. To aid in
system debugging, we strongly recommend that, where possible, RISC-V
processors precisely trap physical memory accesses that fail PMA checks.
Precisely trapped PMA violations manifest as instruction, load, or store
access-fault exceptions, distinct from virtual-memory page-fault
exceptions. Precise PMA traps might not always be possible, for example,
when probing a legacy bus architecture that uses access failures as part
of the discovery mechanism. In this case, error responses from
peripheral devices will be reported as imprecise bus-error interrupts.
PMAs must also be readable by software to correctly access certain
devices or to correctly configure other hardware components that access
memory, such as DMA engines. As PMAs are tightly tied to a given
physical platform’s organization, many details are inherently
platform-specific, as is the means by which software can learn the PMA
values for a platform. Some devices, particularly legacy buses, do not
support discovery of PMAs and so will give error responses or time out
if an unsupported access is attempted. Typically, platform-specific
machine-mode code will extract PMAs and ultimately present this
information to higher-level less-privileged software using some standard
representation.
Where platforms support dynamic reconfiguration of PMAs, an interface
will be provided to set the attributes by passing requests to a
machine-mode driver that can correctly reconfigure the platform. For
example, switching cacheability attributes on some memory regions might
involve platform-specific operations, such as cache flushes, that are
available only to machine-mode.
Main Memory versus I/O versus Vacant Regions

The most important characterization of a given memory address range is
whether it holds regular main memory, or I/O devices, or is vacant.
Regular main memory is required to have a number of properties,
specified below, whereas I/O devices can have a much broader range of
attributes. Memory regions that do not fit into regular main memory, for
example, device scratchpad RAMs, are categorized as I/O regions. Vacant
regions are also classified as I/O regions but with attributes
specifying that no accesses are supported.
Supported Access Type PMAs

Access types specify which access widths, from 8-bit byte to long
multi-word burst, are supported, and also whether misaligned accesses
are supported for each access width.

Although software running on a RISC-V hart cannot directly generate
bursts to memory, software might have to program DMA engines to access
I/O devices and might therefore need to know which access sizes are
supported.

Main memory regions always support read and write of all access widths
required by the attached devices, and can specify whether instruction
fetch is supported.

Some platforms might mandate that all of main memory support instruction
fetch. Other platforms might prohibit instruction fetch from some main
memory regions.


In some cases, the design of a processor or device accessing main memory
might support other widths, but must be able to function with the types
supported by the main memory.

I/O regions can specify which combinations of read, write, or execute
accesses to which data widths are supported.
For systems with page-based virtual memory, I/O and memory regions can
specify which combinations of hardware page-table reads and hardware
page-table writes are supported.

Unix-like operating systems generally require that all of cacheable main
memory supports page-table walks.

Atomicity PMAs

Atomicity PMAs describes which atomic instructions are supported in this
address region. Support for atomic instructions is divided into two
categories: LR/SC and AMOs.

Some platforms might mandate that all of cacheable main memory support
all atomic operations required by the attached processors.

AMO PMA

Within AMOs, there are four levels of support: AMONone, AMOSwap,
AMOLogical, and AMOArithmetic. AMONone indicates that no AMO
operations are supported. AMOSwap indicates that only amoswap
instructions are supported in this address range. AMOLogical indicates
that swap instructions plus all the logical AMOs (amoand, amoor,
amoxor) are supported. AMOArithmetic indicates that all RISC-V AMOs
are supported. For each level of support, naturally aligned AMOs of a
given width are supported if the underlying memory region supports reads
and writes of that width. Main memory and I/O regions may only support a
subset or none of the processor-supported atomic operations.


AMO Class
Supported Operations


AMONone
None


AMOSwap
amoswap


AMOLogical
above + amoand, amoor, amoxor


AMOArithmetic
above + amoadd, amomin, amomax, amominu, amomaxu


We recommend providing at least AMOLogical support for I/O regions where
possible.

Reservability PMA

For LR/SC, there are three levels of support indicating combinations
of the reservability and eventuality properties: RsrvNone,
RsrvNonEventual, and RsrvEventual. RsrvNone indicates that no LR/SC
operations are supported (the location is non-reservable).
RsrvNonEventual indicates that the operations are supported (the
location is reservable), but without the eventual success guarantee
described in the unprivileged ISA specification. RsrvEventual indicates
that the operations are supported and provide the eventual success
guarantee.

We recommend providing RsrvEventual support for main memory regions
where possible. Most I/O regions will not support LR/SC accesses, as
these are most conveniently built on top of a cache-coherence scheme,
but some may support RsrvNonEventual or RsrvEventual.


When LR/SC is used for memory locations marked RsrvNonEventual, software
should provide alternative fall-back mechanisms used when lack of
progress is detected.

Alignment

Memory regions that support aligned LR/SC or aligned AMOs might also
support misaligned LR/SC or misaligned AMOs for some addresses and
access widths. If, for a given address and access width, a misaligned
LR/SC or AMO generates an address-misaligned exception, then all
loads, stores, LRs/SCs, and AMOs using that address and access width
must generate address-misaligned exceptions.

The standard “A” extension does not support misaligned AMOs or LR/SC
pairs. Support for misaligned AMOs is provided by the standard “Zam”
extension. Support for misaligned LR/SC sequences is not currently
standardized, so LR and SC to misaligned addresses must raise an
exception.
Mandating that misaligned loads and stores raise address-misaligned
exceptions wherever misaligned AMOs raise address-misaligned exceptions
permits the emulation of misaligned AMOs in an M-mode trap handler. The
handler guarantees atomicity by acquiring a global mutex and emulating
the access within the critical section. Provided that the handler for
misaligned loads and stores uses the same mutex, all accesses to a given
address that use the same word size will be mutually atomic.

Implementations may raise access-fault exceptions instead of
address-misaligned exceptions for some misaligned accesses, indicating
the instruction should not be emulated by a trap handler. If, for a
given address and access width, all misaligned LRs/SCs and AMOs generate
access-fault exceptions, then regular misaligned loads and stores using
the same address and access width are not required to execute
atomically.
Memory-Ordering PMAs

Regions of the address space are classified as either main memory or
I/O for the purposes of ordering by the FENCE instruction and
atomic-instruction ordering bits.
Accesses by one hart to main memory regions are observable not only by
other harts but also by other devices with the capability to initiate
requests in the main memory system (e.g., DMA engines). Coherent main
memory regions always have either the RVWMO or RVTSO memory model.
Incoherent main memory regions have an implementation-defined memory
model.
Accesses by one hart to an I/O region are observable not only by other
harts and bus mastering devices but also by the targeted I/O devices,
and I/O regions may be accessed with either relaxed or strong
ordering. Accesses to an I/O region with relaxed ordering are generally
observed by other harts and bus mastering devices in a manner similar to
the ordering of accesses to an RVWMO memory region, as discussed in
Section A.4.2 in Volume I of this specification. By contrast, accesses
to an I/O region with strong ordering are generally observed by other
harts and bus mastering devices in program order.
Each strongly ordered I/O region specifies a numbered ordering channel,
which is a mechanism by which ordering guarantees can be provided
between different I/O regions. Channel 0 is used to indicate
point-to-point strong ordering only, where only accesses by the hart to
the single associated I/O region are strongly ordered.
Channel 1 is used to provide global strong ordering across all I/O
regions. Any accesses by a hart to any I/O region associated with
channel 1 can only be observed to have occurred in program order by all
other harts and I/O devices, including relative to accesses made by that
hart to relaxed I/O regions or strongly ordered I/O regions with
different channel numbers. In other words, any access to a region in
channel 1 is equivalent to executing a fence io,io instruction before
and after the instruction.
Other larger channel numbers provide program ordering to accesses by
that hart across any regions with the same channel number.
Systems might support dynamic configuration of ordering properties on
each memory region.

Strong ordering can be used to improve compatibility with legacy device
driver code, or to enable increased performance compared to insertion of
explicit ordering instructions when the implementation is known to not
reorder accesses.
Local strong ordering (channel 0) is the default form of strong ordering
as it is often straightforward to provide if there is only a single
in-order communication path between the hart and the I/O device.
Generally, different strongly ordered I/O regions can share the same
ordering channel without additional ordering hardware if they share the
same interconnect path and the path does not reorder requests.

Coherence and Cacheability PMAs

Coherence is a property defined for a single physical address, and
indicates that writes to that address by one agent will eventually be
made visible to other agents in the system. Coherence is not to be
confused with the memory consistency model of a system, which defines
what values a memory read can return given the previous history of reads
and writes to the entire memory system. In RISC-V platforms, the use of
hardware-incoherent regions is discouraged due to software complexity,
performance, and energy impacts.
The cacheability of a memory region should not affect the software view
of the region except for differences reflected in other PMAs, such as
main memory versus I/O classification, memory ordering, supported
accesses and atomic operations, and coherence. For this reason, we treat
cacheability as a platform-level setting managed by machine-mode
software only.
Where a platform supports configurable cacheability settings for a
memory region, a platform-specific machine-mode routine will change the
settings and flush caches if necessary, so the system is only incoherent
during the transition between cacheability settings. This transitory
state should not be visible to lower privilege levels.

Coherence is straightforward to provide for a shared memory region that
is not cached by any agent. The PMA for such a region would simply
indicate it should not be cached in a private or shared cache.
Coherence is also straightforward for read-only regions, which can be
safely cached by multiple agents without requiring a cache-coherence
scheme. The PMA for this region would indicate that it can be cached,
but that writes are not supported.
Some read-write regions might only be accessed by a single agent, in
which case they can be cached privately by that agent without requiring
a coherence scheme. The PMA for such regions would indicate they can be
cached. The data can also be cached in a shared cache, as other agents
should not access the region.
If an agent can cache a read-write region that is accessible by other
agents, whether caching or non-caching, a cache-coherence scheme is
required to avoid use of stale values. In regions lacking hardware cache
coherence (hardware-incoherent regions), cache coherence can be
implemented entirely in software, but software coherence schemes are
notoriously difficult to implement correctly and often have severe
performance impacts due to the need for conservative software-directed
cache-flushing. Hardware cache-coherence schemes require more complex
hardware and can impact performance due to the cache-coherence probes,
but are otherwise invisible to software.
For each hardware cache-coherent region, the PMA would indicate that the
region is coherent and which hardware coherence controller to use if the
system has multiple coherence controllers. For some systems, the
coherence controller might be an outer-level shared cache, which might
itself access further outer-level cache-coherence controllers
hierarchically.
Most memory regions within a platform will be coherent to software,
because they will be fixed as either uncached, read-only, hardware
cache-coherent, or only accessed by one agent.

If a PMA indicates non-cacheability, then accesses to that region must
be satisfied by the memory itself, not by any caches.

For implementations with a cacheability-control mechanism, the situation
may arise that a program uncacheably accesses a memory location that is
currently cache-resident. In this situation, the cached copy must be
ignored. This constraint is necessary to prevent more-privileged modes’
speculative cache refills from affecting the behavior of less-privileged
modes’ uncacheable accesses.

Idempotency PMAs

Idempotency PMAs describe whether reads and writes to an address region
are idempotent. Main memory regions are assumed to be idempotent. For
I/O regions, idempotency on reads and writes can be specified separately
(e.g., reads are idempotent but writes are not). If accesses are
non-idempotent, i.e., there is potentially a side effect on any read or
write access, then speculative or redundant accesses must be avoided.
For the purposes of defining the idempotency PMAs, changes in observed
memory ordering created by redundant accesses are not considered a side
effect.

While hardware should always be designed to avoid speculative or
redundant accesses to memory regions marked as non-idempotent, it is
also necessary to ensure software or compiler optimizations do not
generate spurious accesses to non-idempotent memory regions.


Non-idempotent regions might not support misaligned accesses. Misaligned
accesses to such regions should raise access-fault exceptions rather
than address-misaligned exceptions, indicating that software should not
emulate the misaligned access using multiple smaller accesses, which
could cause unexpected side effects.

For non-idempotent regions, implicit reads and writes must not be
performed early or speculatively, with the following exceptions. When a
non-speculative implicit read is performed, an implementation is
permitted to additionally read any of the bytes within a naturally
aligned power-of-2 region containing the address of the non-speculative
implicit read. Furthermore, when a non-speculative instruction fetch is
performed, an implementation is permitted to additionally read any of
the bytes within the next naturally aligned power-of-2 region of the
same size (with the address of the region taken modulo
2^XLEN). The results of these additional reads may be used to
satisfy subsequent early or speculative implicit reads. The size of
these naturally aligned power-of-2 regions is implementation-defined,
but, for systems with page-based virtual memory, must not exceed the
smallest supported page size.
Physical Memory Protection

To support secure processing and contain faults, it is desirable to
limit the physical addresses accessible by software running on a hart.
An optional physical memory protection (PMP) unit provides per-hart
machine-mode control registers to allow physical memory access
privileges (read, write, execute) to be specified for each physical
memory region. The PMP values are checked in parallel with the PMA
checks described in Section 1.6.
The granularity of PMP access control settings are platform-specific,
but the standard PMP encoding supports regions as small as four bytes.
Certain regions’ privileges can be hardwired—for example, some regions
might only ever be visible in machine mode but in no lower-privilege
layers.

Platforms vary widely in demands for physical memory protection, and
some platforms may provide other PMP structures in addition to or
instead of the scheme described in this section.

PMP checks are applied to all accesses whose effective privilege mode is
S or U, including instruction fetches and data accesses in S and U mode,
and data accesses in M-mode when the MPRV bit in mstatus is set and
the MPP field in mstatus contains S or U. PMP checks are also applied
to page-table accesses for virtual-address translation, for which the
effective privilege mode is S. Optionally, PMP checks may additionally
apply to M-mode accesses, in which case the PMP registers themselves are
locked, so that even M-mode software cannot change them until the hart
is reset. In effect, PMP can grant permissions to S and U modes, which
by default have none, and can revoke permissions from M-mode, which by
default has full permissions.
PMP violations are always trapped precisely at the processor.
Physical Memory Protection CSRs

PMP entries are described by an 8-bit configuration register and one
MXLEN-bit address register. Some PMP settings additionally use the
address register associated with the preceding PMP entry. Up to 64 PMP
entries are supported. Implementations may implement zero, 16, or 64 PMP
entries; the lowest-numbered PMP entries must be implemented first. All
PMP CSR fields are  and may be read-only zero. PMP CSRs are only
accessible to M-mode.
The PMP configuration registers are densely packed into CSRs to minimize
context-switch time. For RV32, sixteen CSRs, pmpcfg0–pmpcfg15, hold
the configurations pmp0cfg–pmp63cfg for the 64 PMP entries, as shown
in Figure [pmpcfg-rv32]. For RV64, eight
even-numbered CSRs, pmpcfg0, pmpcfg2, …, pmpcfg14, hold the
configurations for the 64 PMP entries, as shown in
Figure [pmpcfg-rv64]. For RV64, the
odd-numbered configuration registers, pmpcfg1, pmpcfg3, …,
pmpcfg15, are illegal.

RV64 systems use pmpcfg2, rather than pmpcfg1, to hold
configurations for PMP entries 8–15. This design reduces the cost of
supporting multiple MXLEN values, since the configurations for PMP
entries 8–11 appear in pmpcfg2[31:0] for both RV32 and RV64.


|  | Y | Y | Y | Yl | | | |

|:- |:- |:- |:-
| | | | |
| pmpcfg0 8 | 8 | 8 | 8 | | | | | | | | | pmpcfg1 8 | 8 | 8 | 8 |   |     | | | | | | | | pmpcfg15 8 | 8 | 8 | 8 | 


|  | Y | Y | Y | Y | Y | Y | Y | Yl | | | | | | | |

|:- |:- |:- |:- |:- |:- |:- |:-
| | | | | | | | |
| pmpcfg0 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | | | | | | | | | | | | | | | | | pmpcfg2 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |   |     | | | | | | | | | | | | | | | | pmpcfg14 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 


The PMP address registers are CSRs named pmpaddr0–pmpaddr63. Each
PMP address register encodes bits 33–2 of a 34-bit physical address for
RV32, as shown in
Figure [pmpaddr-rv32]. For RV64, each PMP
address register encodes bits 55–2 of a 56-bit physical address, as
shown in Figure [pmpaddr-rv64]. Not all physical
address bits may be implemented, and so the pmpaddr registers are .

The Sv32 page-based virtual-memory scheme described in
Section [sec:sv32] supports 34-bit physical
addresses for RV32, so the PMP scheme must support addresses wider than
XLEN for RV32. The Sv39 and Sv48 page-based virtual-memory schemes
described in Sections [sec:sv39]
and [sec:sv48] support a 56-bit physical
address space, so the RV64 PMP address registers impose the same limit.


|  | J

|:-


|  | F | J |

|:- |:-
| |

| | 54


Figure [pmpcfg] shows the layout of a PMP
configuration register. The R, W, and X bits, when set, indicate that
the PMP entry permits read, write, and instruction execution,
respectively. When one of these bits is clear, the corresponding access
type is denied. The R, W, and X fields form a collective  field for
which the combinations with R=0 and W=1 are reserved. The remaining two
fields, A and L, are described in the following sections.


| YSSYYY | | | | |

| | | | | |

| | 2 | 2 | 1 | 1 | 1


Attempting to fetch an instruction from a PMP region that does not have
execute permissions raises an instruction access-fault exception.
Attempting to execute a load or load-reserved instruction which accesses
a physical address within a PMP region without read permissions raises a
load access-fault exception. Attempting to execute a store,
store-conditional, or AMO instruction which accesses a physical address
within a PMP region without write permissions raises a store
access-fault exception.
If MXLEN is changed, the contents of the pmpxcfg fields are preserved,
but appear in the pmpcfgy CSR prescribed by the new setting of MXLEN.
For example, when MXLEN is changed from 64 to 32,  pmp4cfg moves from
pmpcfg0[39:32] to pmpcfg1[7:0]. The  pmpaddr CSRs follow the
usual CSR width modulation rules described in
Section [sec:csrwidthmodulation].
Address Matching

The A field in a PMP entry’s configuration register encodes the
address-matching mode of the associated PMP address register. The
encoding of this field is shown in
Table [pmpcfg-a]. When A=0, this PMP entry is
disabled and matches no addresses. Two other address-matching modes are
supported: naturally aligned power-of-2 regions (NAPOT), including the
special case of naturally aligned four-byte regions (NA4); and the top
boundary of an arbitrary range (TOR). These modes support four-byte
granularity.


A
Name
Description


0
OFF
Null region (disabled)


1
TOR
Top of range


2
NA4
Naturally aligned four-byte region


3
NAPOT
Naturally aligned power-of-two region, ≥8 bytes


NAPOT ranges make use of the low-order bits of the associated address
register to encode the size of the range, as shown in
Table [pmpcfg-napot].


pmpaddr 
pmpcfg.A
Match type and size


yyyy...yyyy 
NA4
4-byte NAPOT range


yyyy...yyy0 
NAPOT
8-byte NAPOT range


yyyy...yy01 
NAPOT
16-byte NAPOT range


yyyy...y011 
NAPOT
32-byte NAPOT range


…
…
…


yy01...1111 
NAPOT
2^XLEN-byte NAPOT range


y011...1111 
NAPOT
2^XLEN + 1-byte NAPOT range


0111...1111 
NAPOT
2^XLEN + 2-byte NAPOT range


1111...1111 
NAPOT
2^XLEN + 3-byte NAPOT range


If TOR is selected, the associated address register forms the top of the
address range, and the preceding PMP address register forms the bottom
of the address range. If PMP entry i’s A field is set to TOR, the
entry matches any address y such that
${\tt pmpaddr}_{i-1}\leq y &lt; {\tt pmpaddr}_i$ (irrespective of the
value of ${\tt pmpcfg}_{i-1}$). If PMP entry 0’s A field is set to
TOR, zero is used for the lower bound, and so it matches any address
$y &lt; {\tt pmpaddr}_0$.

If ${\tt pmpaddr}_{i-1}\geq {\tt pmpaddr}_i$ and
${\tt pmpcfg_i.A}$=TOR, then PMP entry i matches no addresses.

Although the PMP mechanism supports regions as small as four bytes,
platforms may specify coarser PMP regions. In general, the PMP grain is
2^G + 2 bytes and must be the same across all PMP regions.
When G ≥ 1, the NA4 mode is not selectable. When G ≥ 2 and
${\tt pmpcfg}_i$.A[1] is set, i.e. the mode is NAPOT, then bits
${\tt pmpaddr}_i$[G-2:0] read as all ones. When G ≥ 1 and
${\tt pmpcfg}_i$.A[1] is clear, i.e. the mode is OFF or TOR, then
bits ${\tt pmpaddr}_i$[G-1:0] read as all zeros. Bits ${\tt
pmpaddr}_i$[G-1:0] do not affect the TOR address-matching logic.
Although changing ${\tt pmpcfg}_i$.A[1] affects the value read from
${\tt pmpaddr}_i$, it does not affect the underlying value stored in
that register—in particular, ${\tt pmpaddr}_i$[G-1] retains its
original value when ${\tt pmpcfg}_i$.A is changed from NAPOT to
TOR/OFF then back to NAPOT.

Software may determine the PMP granularity by writing zero to pmp0cfg,
then writing all ones to pmpaddr0, then reading back pmpaddr0. If
G is the index of the least-significant bit set, the PMP granularity
is 2^G + 2 bytes.

If the current XLEN is greater than MXLEN, the PMP address registers are
zero-extended from MXLEN to XLEN bits for the purposes of address
matching.
Locking and Privilege Mode

The L bit indicates that the PMP entry is locked, i.e., writes to the
configuration register and associated address registers are ignored.
Locked PMP entries remain locked until the hart is reset. If PMP entry
i is locked, writes to pmpicfg and pmpaddri are ignored.
Additionally, if PMP entry i is locked and pmpicfg.A is set to
TOR, writes to pmpaddri-1 are ignored.

Setting the L bit locks the PMP entry even when the A field is set to
OFF.

In addition to locking the PMP entry, the L bit indicates whether the
R/W/X permissions are enforced on M-mode accesses. When the L bit is
set, these permissions are enforced for all privilege modes. When the L
bit is clear, any M-mode access matching the PMP entry will succeed; the
R/W/X permissions apply only to S and U modes.
Priority and Matching Logic

PMP entries are statically prioritized. The lowest-numbered PMP entry
that matches any byte of an access determines whether that access
succeeds or fails. The matching PMP entry must match all bytes of an
access, or the access fails, irrespective of the L, R, W, and X bits.
For example, if a PMP entry is configured to match the four-byte range
0xC–0xF, then an 8-byte access to the range 0x8–0xF will fail,
assuming that PMP entry is the highest-priority entry that matches those
addresses.
If a PMP entry matches all bytes of an access, then the L, R, W, and X
bits determine whether the access succeeds or fails. If the L bit is
clear and the privilege mode of the access is M, the access succeeds.
Otherwise, if the L bit is set or the privilege mode of the access is S
or U, then the access succeeds only if the R, W, or X bit corresponding
to the access type is set.
If no PMP entry matches an M-mode access, the access succeeds. If no PMP
entry matches an S-mode or U-mode access, but at least one PMP entry is
implemented, the access fails.

If at least one PMP entry is implemented, but all PMP entries’ A fields
are set to OFF, then all S-mode and U-mode memory accesses will fail.

Failed accesses generate an instruction, load, or store access-fault
exception. Note that a single instruction may generate multiple
accesses, which may not be mutually atomic. An access-fault exception is
generated if at least one access generated by an instruction fails,
though other accesses generated by that instruction may succeed with
visible side effects. Notably, instructions that reference virtual
memory are decomposed into multiple accesses.
On some implementations, misaligned loads, stores, and instruction
fetches may also be decomposed into multiple accesses, some of which may
succeed before an access-fault exception occurs. In particular, a
portion of a misaligned store that passes the PMP check may become
visible, even if another portion fails the PMP check. The same behavior
may manifest for floating-point stores wider than XLEN bits (e.g., the
FSD instruction in RV32D), even when the store address is naturally
aligned.
Physical Memory Protection and Paging

The Physical Memory Protection mechanism is designed to compose with the
page-based virtual memory systems described in
Chapter [supervisor]. When paging is enabled,
instructions that access virtual memory may result in multiple
physical-memory accesses, including implicit references to the page
tables. The PMP checks apply to all of these accesses. The effective
privilege mode for implicit page-table accesses is S.
Implementations with virtual memory are permitted to perform address
translations speculatively and earlier than required by an explicit
memory access, and are permitted to cache them in address translation
cache structures—including possibly caching the identity mappings from
effective address to physical address used in Bare translation modes and
M-mode. The PMP settings for the resulting physical address may be
checked (and possibly cached) at any point between the address
translation and the explicit memory access. Hence, when the PMP settings
are modified, M-mode software must synchronize the PMP settings with the
virtual memory system and any PMP or address-translation caches. This is
accomplished by executing an SFENCE.VMA instruction with rs1=x0 and
rs2=x0, after the PMP CSRs are written.
If page-based virtual memory is not implemented, memory accesses check
the PMP settings synchronously, so no SFENCE.VMA is needed.
“Smrnmi” Standard Extension for Resumable Non-Maskable Interrupts, Version 0.4

Warning! This draft specification may change before being accepted as
standard by RISC-V International.
The base machine-level architecture supports only unresumable
non-maskable interrupts (UNMIs), where the NMI jumps to a handler in
machine mode, overwriting the current mepc and mcause register
values. If the hart had been executing machine-mode code in a trap
handler, the previous values in mepc and mcause would not be
recoverable and so execution is not generally resumable.
The Smrnmi extension adds support for resumable non-maskable interrupts
(RNMIs) to RISC-V. The extension adds four new CSRs (mnepc, mncause,
mnstatus, and mnscratch) to hold the interrupted state, and one new
instruction, MNRET, to resume from the RNMI handler.
RNMI Interrupt Signals

The rnmi interrupt signals are inputs to the hart. These interrupts
have higher priority than any other interrupt or exception on the hart
and cannot be disabled by software. Specifically, they are not disabled
by clearing the mstatus.MIE register.
RNMI Handler Addresses

The RNMI interrupt trap handler address is implementation-defined.
RNMI also has an associated exception trap handler address, which is
implementation defined.
RNMI CSRs

This proposal adds additional M-mode CSRs to enable a resumable
non-maskable interrupt (RNMI).


J
MXLEN


The mnscratch CSR holds an MXLEN-bit read-write register which enables
the NMI trap handler to save and restore the context that was
interrupted.


J
MXLEN


The mnepc CSR is an MXLEN-bit read-write register which on entry to
the NMI trap handler holds the PC of the instruction that took the
interrupt.
The low bit of mnepc (mnepc[0]) is always zero. On implementations
that support only IALIGN=32, the two low bits (mnepc[1:0]) are always
zero.
If an implementation allows IALIGN to be either 16 or 32 (by changing
CSR misa, for example), then, whenever IALIGN=32, bit mnepc[1] is
masked on reads so that it appears to be 0. This masking occurs also for
the implicit read by the MRET instruction. Though masked, mnepc[1]
remains writable when IALIGN=32.
mnepc is a  register that must be able to hold all valid virtual
addresses. It need not be capable of holding all possible invalid
addresses. Prior to writing mnepc, implementations may convert an
invalid address into some other invalid address that mnepc is capable
of holding.


| cU |

| |

| | MXLEN-1


The mncause CSR holds the reason for the NMI, with bit MXLEN-1 set to
1, and the NMI cause encoded in the least-significant bits or zero if
NMI causes are not supported.


| TRFcFcF | | | | | |

| | | | | | |

| MXLEN-13 | 2 | 3 | 1 | 3 | 1 | 3


The mnstatus CSR holds a two-bit field, MNPP, which on entry to the
trap handler holds the privilege mode of the interrupted context,
encoded in the same manner as mstatus.MPP. It also holds a one-bit
field, MNPV, which on entry to the trap handler holds the virtualization
mode of the interrupted context, encoded in the same manner as
mstatus.MPV.
mnstatus also holds the NMIE bit. When NMIE=1, nonmaskable interrupts
are enabled. When NMIE=0, all interrupts are disabled.
When NMIE=0, the hart behaves as though mstatus.MPRV were clear,
regardless of the current setting of mstatus.MPRV.
Upon reset, NMIE contains the value 0.

RNMIs are masked out of reset to give software the opportunity to
initialize data structures and devices for subsequent RNMI handling.

Software can set NMIE to 1, but attempts to clear NMIE have no effect.

Normally, only reset sequences will explicitly set the NMIE bit.


That the NMIE bit is settable does not suffice to support the nesting of
RNMIs. To support this feature in a direct manner would have required
allowing software to clear the NMIE bit—a design choice that would have
contravened the concept of non-maskability.
Software that wishes to minimize the latency until the next RNMI is
taken can follow the top-half/bottom-half model, where the RNMI handler
itself only enqueues a task to a task queue then returns. The bulk of
the interrupt servicing is performed later, with RNMIs enabled.

For the purposes of the WFI instruction, NMIE is a global interrupt
enable, meaning that the setting of NMIE does not affect the operation
of the WFI instruction.
The other bits in mnstatus are reserved; software should write zeros
and hardware implementations should return zeros.
MNRET Instruction

MNRET is an M-mode-only instruction that uses the values in mnepc and
mnstatus to return to the program counter, privilege mode, and
virtualization mode of the interrupted context. This instruction also
sets mnstatus.NMIE.
RNMI Operation

When an RNMI interrupt is detected, the interrupted PC is written to the
mnepc CSR, the type of RNMI to the mncause CSR, and the privilege
mode of the interrupted context to the mnstatus CSR. The
mnstatus.NMIE bit is cleared, masking all interrupts.
The hart then enters machine-mode and jumps to the RNMI trap handler
address.
The RNMI handler can resume original execution using the new MNRET
instruction, which restores the PC from mnepc, the privilege mode from
mnstatus, and also sets mnstatus.NMIE, which re-enables interrupts.
If the hart encounters an exception while the mnstatus.NMIE bit is
clear, the actions taken are the same as if the exception had occurred
while mnstatus.NMIE were set, except that the program counter is set
to the RNMI exception trap handler address (rather than the address
specified by mtvec).

The Smrnmi extension does not change the behavior of the MRET and SRET
instructions. In particular, MRET and SRET are unaffected by the
mnstatus.NMIE bit, and their execution does not alter the
mnstatus.NMIE bit.

# Supervisor-Level ISA, Version 1.12
This chapter describes the RISC-V supervisor-level architecture, which
contains a common core that is used with various supervisor-level
address translation and protection schemes.

Supervisor mode is deliberately restricted in terms of interactions with
underlying physical hardware, such as physical memory and device
interrupts, to support clean virtualization. In this spirit, certain
supervisor-level facilities, including requests for timer and
interprocessor interrupts, are provided by implementation-specific
mechanisms. In some systems, a supervisor execution environment (SEE)
provides these facilities in a manner specified by a supervisor binary
interface (SBI). Other systems supply these facilities directly, through
some other implementation-defined mechanism.

Supervisor CSRs

A number of CSRs are provided for the supervisor.

The supervisor should only view CSR state that should be visible to a
supervisor-level operating system. In particular, there is no
information about the existence (or non-existence) of higher privilege
levels (machine level or other) visible in the CSRs accessible by the
supervisor.
Many supervisor CSRs are a subset of the equivalent machine-mode CSR,
and the machine-mode chapter should be read first to help understand the
supervisor-level CSR descriptions.

Supervisor Status Register (sstatus)

The sstatus register is an SXLEN-bit read/write register formatted as
shown in Figure [sstatusreg-rv32] when SXLEN=32 and
Figure [sstatusreg] when SXLEN=64. The
sstatus register keeps track of the processor’s current operating
state.


cEcccc

| | | | | |

| | | | | |

| | 11 | 1 | 1 | 1 |


cWWWWccccWcc

| | | | | | | | | | | |

| | | | | | | | | | | |

| | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1


cMFScccc

| | | | | | | |

| | | | | | | |

| | 29 | 2 | 12 | 1 | 1 | 1 |


cWWWWccccWcc

| | | | | | | | | | | |

| | | | | | | | | | | |

| | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1


The SPP bit indicates the privilege level at which a hart was executing
before entering supervisor mode. When a trap is taken, SPP is set to 0
if the trap originated from user mode, or 1 otherwise. When an SRET
instruction (see Section [otherpriv]) is executed to return from
the trap handler, the privilege level is set to user mode if the SPP bit
is 0, or supervisor mode if the SPP bit is 1; SPP is then set to 0.
The SIE bit enables or disables all interrupts in supervisor mode. When
SIE is clear, interrupts are not taken while in supervisor mode. When
the hart is running in user-mode, the value in SIE is ignored, and
supervisor-level interrupts are enabled. The supervisor can disable
individual interrupt sources using the sie CSR.
The SPIE bit indicates whether supervisor interrupts were enabled prior
to trapping into supervisor mode. When a trap is taken into supervisor
mode, SPIE is set to SIE, and SIE is set to 0. When an SRET instruction
is executed, SIE is set to SPIE, then SPIE is set to 1.
The sstatus register is a subset of the mstatus register.

In a straightforward implementation, reading or writing any field in
sstatus is equivalent to reading or writing the homonymous field in
mstatus.

Base ISA Control in sstatus Register

The UXL field controls the value of XLEN for U-mode, termed UXLEN,
which may differ from the value of XLEN for S-mode, termed SXLEN. The
encoding of UXL is the same as that of the MXL field of misa, shown in
Table [misabase].
When SXLEN=32, the UXL field does not exist, and UXLEN=32. When
SXLEN=64, it is a  field that encodes the current value of UXLEN. In
particular, an implementation may make UXL be a read-only field whose
value always ensures that UXLEN=SXLEN.
If UXLEN ≠ SXLEN, instructions executed in the narrower mode must ignore
source register operand bits above the configured XLEN, and must
sign-extend results to fill the widest supported XLEN in the destination
register.
If UXLEN < SXLEN, user-mode instruction-fetch addresses and load and
store effective addresses are taken modulo 2^UXLEN. For
example, when UXLEN=32 and SXLEN=64, user-mode memory accesses reference
the lowest of the address space.
Memory Privilege in sstatus Register

The MXR (Make eXecutable Readable) bit modifies the privilege with which
loads access virtual memory. When MXR=0, only loads from pages marked
readable (R=1 in Figure [sv32pte]) will succeed. When MXR=1, loads
from pages marked either readable or executable (R=1 or X=1) will
succeed. MXR has no effect when page-based virtual memory is not in
effect.
The SUM (permit Supervisor User Memory access) bit modifies the
privilege with which S-mode loads and stores access virtual memory. When
SUM=0, S-mode memory accesses to pages that are accessible by U-mode
(U=1 in Figure [sv32pte]) will fault. When SUM=1, these
accesses are permitted. SUM has no effect when page-based virtual memory
is not in effect, nor when executing in U-mode. Note that S-mode can
never execute instructions from user pages, regardless of the state of
SUM.
SUM is read-only 0 if satp.MODE is read-only 0.

The SUM mechanism prevents supervisor software from inadvertently
accessing user memory. Operating systems can execute the majority of
code with SUM clear; the few code segments that should access user
memory can temporarily set SUM.
The SUM mechanism does not avail S-mode software of permission to
execute instructions in user code pages. Legitimate uses cases for
execution from user memory in supervisor context are rare in general and
nonexistent in POSIX environments. However, bugs in supervisors that
lead to arbitrary code execution are much easier to exploit if the
supervisor exploit code can be stored in a user buffer at a virtual
address chosen by an attacker.
Some non-POSIX single address space operating systems do allow certain
privileged software to partially execute in supervisor mode, while most
programs run in user mode, all in a shared address space. This use case
can be realized by mapping the physical code pages at multiple virtual
addresses with different permissions, possibly with the assistance of
the instruction page-fault handler to direct supervisor software to use
the alternate mapping.

Endianness Control in sstatus Register

The UBE bit is a  field that controls the endianness of explicit memory
accesses made from U-mode, which may differ from the endianness of
memory accesses in S-mode. An implementation may make UBE be a read-only
field that always specifies the same endianness as for S-mode.
UBE controls whether explicit load and store memory accesses made from
U-mode are little-endian (UBE=0) or big-endian (UBE=1).
UBE has no effect on instruction fetches, which are implicit memory
accesses that are always little-endian.
For implicit accesses to supervisor-level memory management data
structures, such as page tables, S-mode endianness always applies and
UBE is ignored.

Standard RISC-V ABIs are expected to be purely little-endian-only or
big-endian-only, with no accommodation for mixing endianness.
Nevertheless, endianness control has been defined so as to permit an OS
of one endianness to execute user-mode programs of the opposite
endianness.

Supervisor Trap Vector Base Address Register (stvec)

The stvec register is an SXLEN-bit read/write register that holds trap
vector configuration, consisting of a vector base address (BASE) and a
vector mode (MODE).


| J | R |

|:-
| |

| SXLEN-2 | 2


The BASE field in stvec is a  field that can hold any valid virtual or
physical address, subject to the following alignment constraints: the
address must be 4-byte aligned, and MODE settings other than Direct
might impose additional alignment constraints on the value in the BASE
field.


Value
Name
Description


0
Direct
All exceptions set pc to BASE.


1
Vectored
Asynchronous interrupts set pc to BASE+4×cause.


≥2
—
Reserved


The encoding of the MODE field is shown in
Table [stvec-mode]. When MODE=Direct, all
traps into supervisor mode cause the pc to be set to the address in
the BASE field. When MODE=Vectored, all synchronous exceptions into
supervisor mode cause the pc to be set to the address in the BASE
field, whereas interrupts cause the pc to be set to the address in the
BASE field plus four times the interrupt cause number. For example, a
supervisor-mode timer interrupt (see
Table [scauses]) causes the pc to be set to
BASE+0x14. Setting MODE=Vectored may impose a stricter alignment
constraint on BASE.
Supervisor Interrupt Registers (sip and sie)

The sip register is an SXLEN-bit read/write register containing
information on pending interrupts, while sie is the corresponding
SXLEN-bit read/write register containing interrupt enable bits.
Interrupt cause number i (as reported in CSR scause,
Section 1.1.8) corresponds with bit i in both
sip and sie. Bits 15:0 are allocated to standard interrupt causes
only, while bits 16 and above are designated for platform or custom use.


|  | J

|:-
SXLEN


|  | J

|:-
SXLEN


An interrupt i will trap to S-mode if both of the following are true:
(a) either the current privilege mode is S and the SIE bit in the
sstatus register is set, or the current privilege mode has less
privilege than S-mode; and (b) bit i is set in both sip and sie.
These conditions for an interrupt trap to occur must be evaluated in a
bounded amount of time from when an interrupt becomes, or ceases to be,
pending in sip, and must also be evaluated immediately following the
execution of an SRET instruction or an explicit write to a CSR on which
these interrupt trap conditions expressly depend (including sip, sie
and sstatus).
Interrupts to S-mode take priority over any interrupts to lower
privilege modes.
Each individual bit in register sip may be writable or may be
read-only. When bit i in sip is writable, a pending interrupt i
can be cleared by writing 0 to this bit. If interrupt i can become
pending but bit i in sip is read-only, the implementation must
provide some other mechanism for clearing the pending interrupt (which
may involve a call to the execution environment).
A bit in sie must be writable if the corresponding interrupt can ever
become pending. Bits of sie that are not writable are read-only zero.
The standard portions (bits 15:0) of registers sip and sie are
formatted as shown in Figures
[sipreg-standard] and
[siereg-standard] respectively.


| ScFcFcc | | | | | |

| | | | | | |

| | 1 | 3 | 1 | 3 | 1 | 1


| ScFcFcc | | | | | |

| | | | | | |

| | 1 | 3 | 1 | 3 | 1 | 1


Bits sip.SEIP and sie.SEIE are the interrupt-pending and
interrupt-enable bits for supervisor-level external interrupts. If
implemented, SEIP is read-only in sip, and is set and cleared by the
execution environment, typically through a platform-specific interrupt
controller.
Bits sip.STIP and sie.STIE are the interrupt-pending and
interrupt-enable bits for supervisor-level timer interrupts. If
implemented, STIP is read-only in sip, and is set and cleared by the
execution environment.
Bits sip.SSIP and sie.SSIE are the interrupt-pending and
interrupt-enable bits for supervisor-level software interrupts. If
implemented, SSIP is writable in sip and may also be set to 1 by a
platform-specific interrupt controller.

Interprocessor interrupts are sent to other harts by
implementation-specific means, which will ultimately cause the SSIP bit
to be set in the recipient hart’s sip register.

Each standard interrupt type (SEI, STI, or SSI) may not be implemented,
in which case the corresponding interrupt-pending and interrupt-enable
bits are read-only zeros. All bits in sip and sie are  fields. The
implemented interrupts may be found by writing one to every bit location
in sie, then reading back to see which bit positions hold a one.

The sip and sie registers are subsets of the mip and  mie
registers. Reading any implemented field, or writing any writable field,
of sip/sie effects a read or write of the homonymous field of
mip/mie.
Bits 3, 7, and 11 of sip and sie correspond to the machine-mode
software, timer, and external interrupts, respectively. Since most
platforms will choose not to make these interrupts delegatable from
M-mode to S-mode, they are shown as 0 in
Figures [sipreg-standard] and
[siereg-standard].

Multiple simultaneous interrupts destined for supervisor mode are
handled in the following decreasing priority order: SEI, SSI, STI.
Supervisor Timers and Performance Counters

Supervisor software uses the same hardware performance monitoring
facility as user-mode software, including the time, cycle, and
instret CSRs. The implementation should provide a mechanism to modify
the counter values.
The implementation must provide a facility for scheduling timer
interrupts in terms of the real-time counter, time.
Counter-Enable Register (scounteren)


| cccMcccccc | | | | | | | | |

| | | | | | | | | |

| | 1 | 1 | 23 | 1 | 1 | 1 | 1 | 1 | 1


The counter-enable register scounteren is a 32-bit register that
controls the availability of the hardware performance monitoring
counters to U-mode.
When the CY, TM, IR, or HPMn bit in the scounteren register is
clear, attempts to read the cycle, time, instret, or hpmcountern
register while executing in U-mode will cause an illegal instruction
exception. When one of these bits is set, access to the corresponding
register is permitted.
scounteren must be implemented. However, any of the bits may be
read-only zero, indicating reads to the corresponding counter will cause
an exception when executing in U-mode. Hence, they are effectively
 fields.

The setting of a bit in mcounteren does not affect whether the
corresponding bit in scounteren is writable. However, U-mode may only
access a counter if the corresponding bits in  scounteren and
mcounteren are both set.

Supervisor Scratch Register (sscratch)

The sscratch register is an SXLEN-bit read/write register, dedicated
for use by the supervisor. Typically, sscratch is used to hold a
pointer to the hart-local supervisor context while the hart is executing
user code. At the beginning of a trap handler,  sscratch is swapped
with a user register to provide an initial working register.


|  | J

|:-
SXLEN


Supervisor Exception Program Counter (sepc)

sepc is an SXLEN-bit read/write register formatted as shown in
Figure [epcreg]. The low bit of sepc (sepc[0])
is always zero. On implementations that support only IALIGN=32, the two
low bits (sepc[1:0]) are always zero.
If an implementation allows IALIGN to be either 16 or 32 (by changing
CSR misa, for example), then, whenever IALIGN=32, bit sepc[1] is
masked on reads so that it appears to be 0. This masking occurs also for
the implicit read by the SRET instruction. Though masked, sepc[1]
remains writable when IALIGN=32.
sepc is a  register that must be able to hold all valid virtual
addresses. It need not be capable of holding all possible invalid
addresses. Prior to writing sepc, implementations may convert an
invalid address into some other invalid address that sepc is capable
of holding.
When a trap is taken into S-mode, sepc is written with the virtual
address of the instruction that was interrupted or that encountered the
exception. Otherwise, sepc is never written by the implementation,
though it may be explicitly written by software.


|  | J

|:-
SXLEN


Supervisor Cause Register (scause)

The scause register is an SXLEN-bit read-write register formatted as
shown in Figure [scausereg]. When a trap is taken into
S-mode,  scause is written with a code indicating the event that
caused the trap. Otherwise, scause is never written by the
implementation, though it may be explicitly written by software.
The Interrupt bit in the scause register is set if the trap was caused
by an interrupt. The Exception Code field contains a code identifying
the last exception or interrupt.
Table [scauses] lists the possible exception
codes for the current supervisor ISAs. The Exception Code is a  field.
It is required to hold the values 0–31 (i.e., bits 4–0 must be
implemented), but otherwise it is only guaranteed to hold supported
exception codes.


| c | U |

|:-
| |

| | SXLEN-1


Interrupt
Exception Code
Description


1
0
Reserved


1
1
Supervisor software interrupt


1
2–4
Reserved


1
5
Supervisor timer interrupt


1
6–8
Reserved


1
9
Supervisor external interrupt


1
10–15
Reserved


1
≥16
Designated for platform use


0
0
Instruction address misaligned


0
1
Instruction access fault


0
2
Illegal instruction


0
3
Breakpoint


0
4
Load address misaligned


0
5
Load access fault


0
6
Store/AMO address misaligned


0
7
Store/AMO access fault


0
8
Environment call from U-mode


0
9
Environment call from S-mode


0
10–11
Reserved


0
12
Instruction page fault


0
13
Load page fault


0
14
Reserved


0
15
Store/AMO page fault


0
16–23
Reserved


0
24–31
Designated for custom use


0
32–47
Reserved


0
48–63
Designated for custom use


0
≥64
Reserved


Supervisor Trap Value (stval) Register

The stval register is an SXLEN-bit read-write register formatted as
shown in Figure [stvalreg]. When a trap is taken into
S-mode, stval is written with exception-specific information to assist
software in handling the trap. Otherwise, stval is never written by
the implementation, though it may be explicitly written by software. The
hardware platform will specify which exceptions must set stval
informatively and which may unconditionally set it to zero.
If stval is written with a nonzero value when a breakpoint,
address-misaligned, access-fault, or page-fault exception occurs on an
instruction fetch, load, or store, then stval will contain the
faulting virtual address.


|  | J

|:-
SXLEN


If stval is written with a nonzero value when a misaligned load or
store causes an access-fault or page-fault exception, then stval will
contain the virtual address of the portion of the access that caused the
fault.
If stval is written with a nonzero value when an instruction
access-fault or page-fault exception occurs on a system with
variable-length instructions, then stval will contain the virtual
address of the portion of the instruction that caused the fault, while
sepc will point to the beginning of the instruction.
The stval register can optionally also be used to return the faulting
instruction bits on an illegal instruction exception (sepc points to
the faulting instruction in memory). If stval is written with a
nonzero value when an illegal-instruction exception occurs, then stval
will contain the shortest of:

the actual faulting instruction
the first ILEN bits of the faulting instruction
the first SXLEN bits of the faulting instruction

The value loaded into stval on an illegal-instruction exception is
right-justified and all unused upper bits are cleared to zero.
For other traps, stval is set to zero, but a future standard may
redefine stval’s setting for other traps.
stval is a  register that must be able to hold all valid virtual
addresses and the value 0. It need not be capable of holding all
possible invalid addresses. Prior to writing stval, implementations
may convert an invalid address into some other invalid address that
stval is capable of holding. If the feature to return the faulting
instruction bits is implemented,  stval must also be able to hold all
values less than 2^N, where N is the smaller of SXLEN and
ILEN.
Supervisor Environment Configuration Register (senvcfg)

The senvcfg CSR is an SXLEN-bit read/write register, formatted as
shown in Figure [fig:senvcfg], that controls certain
characteristics of the U-mode execution environment.


|  | Kcc | W | Wc | | | | |

|:- |:- |:-
| | | | | |

| SXLEN-8 | 1 | 1 | 2 | 3 | 1


If bit FIOM (Fence of I/O implies Memory) is set to one in senvcfg,
FENCE instructions executed in U-mode are modified so the requirement to
order accesses to device I/O implies also the requirement to order main
memory accesses.
Table 1.1 details the modified
interpretation of FENCE instruction bits PI, PO, SI, and SO in U-mode
when FIOM=1.
Similarly, for U-mode when FIOM=1, if an atomic instruction that
accesses a region ordered as device I/O has its aq and/or rl bit
set, then that instruction is ordered as though it accesses both device
I/O and memory.
If satp.MODE is read-only zero (always Bare), the implementation may
make FIOM read-only zero.


Instruction bit
Meaning when set


PI
Predecessor device input and memory reads (PR implied)


PO
Predecessor device output and memory writes (PW implied)


SI
Successor device input and memory reads (SR implied)


SO
Successor device output and memory writes (SW implied)


Modified interpretation of FENCE predecessor and successor sets in
U-mode when FIOM=1.


Bit FIOM exists for a specific circumstance when an I/O device is being
emulated for U-mode and both of the following are true: (a) the emulated
device has a memory buffer that should be I/O space but is actually
mapped to main memory via address translation, and (b) multiple physical
harts are involved in accessing this emulated device from U-mode.
A hypervisor running in S-mode without the benefit of the hypervisor
extension of Chapter [hypervisor] may need to emulate a
device for U-mode if paravirtualization cannot be employed. If the same
hypervisor provides a virtual machine (VM) with multiple virtual harts,
mapped one-to-one to real harts, then multiple harts may concurrently
access the emulated device, perhaps because: (a) the guest OS within the
VM assigns device interrupt handling to one hart while the device is
also accessed by a different hart outside of an interrupt handler, or
(b) control of the device (or partial control) is being migrated from
one hart to another, such as for interrupt load balancing within the VM.
For such cases, guest software within the VM is expected to properly
coordinate access to the (emulated) device across multiple harts using
mutex locks and/or interprocessor interrupts as usual, which in part
entails executing I/O fences. But those I/O fences may not be sufficient
if some of the device “I/O” is actually main memory, unknown to the
guest. Setting FIOM=1 modifies those fences (and all other I/O fences
executed in U-mode) to include main memory, too.
Software can always avoid the need to set FIOM by never using main
memory to emulate a device memory buffer that should be I/O space.
However, this choice usually requires trapping all U-mode accesses to
the emulated buffer, which might have a noticeable impact on
performance. The alternative offered by FIOM is sufficiently inexpensive
to implement that we consider it worth supporting even if only rarely
enabled.

The definition of the CBZE field will be furnished by the forthcoming
Zicboz extension. Its allocation within senvcfg may change prior to
the ratification of that extension.
The definitions of the CBCFE and CBIE fields will be furnished by the
forthcoming Zicbom extension. Their allocations within senvcfg may
change prior to the ratification of that extension.
Supervisor Address Translation and Protection (satp) Register

The satp register is an SXLEN-bit read/write register, formatted as
shown in Figure [rv32satp] for SXLEN=32 and
Figure [rv64satp] for SXLEN=64, which controls
supervisor-mode address translation and protection. This register holds
the physical page number (PPN) of the root page table, i.e., its
supervisor physical address divided by ; an address space identifier
(ASID), which facilitates address-translation fences on a
per-address-space basis; and the MODE field, which selects the current
address-translation scheme. Further details on the access to this
register are described in
Section [virt-control].


| c | E | K | |

|:- |:-
| | |

| | 9 | 22


Storing a PPN in satp, rather than a physical address, supports a
physical address space larger than for RV32.
The satp.PPN field might not be capable of holding all physical page
numbers. Some platform standards might place constraints on the values
satp.PPN may assume, e.g., by requiring that all physical page numbers
corresponding to main memory be representable.


|  | S | T | U | |

|:- |:- |:-
| | |

| | 16 | 44


We store the ASID and the page table base address in the same CSR to
allow the pair to be changed atomically on a context switch. Swapping
them non-atomically could pollute the old virtual address space with new
translations, or vice-versa. This approach also slightly reduces the
cost of a context switch.

Table 1.2 shows the encodings of the MODE
field when SXLEN=32 and SXLEN=64. When MODE=Bare, supervisor virtual
addresses are equal to supervisor physical addresses, and there is no
additional memory protection beyond the physical memory protection
scheme described in Section [sec:pmp]. To select MODE=Bare, software
must write zero to the remaining fields of satp (bits 30–0 when
SXLEN=32, or bits 59–0 when SXLEN=64). Attempting to select MODE=Bare
with a nonzero pattern in the remaining fields has an  effect on the
value that the remaining fields assume and an  effect on address
translation and protection behavior.
When SXLEN=32, the satp encodings corresponding to MODE=Bare and
ASID[8:7]=3 are designated for custom use, whereas the encodings
corresponding to MODE=Bare and ASID[8:7]≠3 are reserved for
future standard use. When SXLEN=64, all satp encodings corresponding
to MODE=Bare are reserved for future standard use.

Version 1.11 of this standard stated that the remaining fields in satp
had no effect when MODE=Bare. Making these fields reserved facilitates
future definition of additional translation and protection modes,
particularly in RV32, for which all patterns of the existing MODE field
have already been allocated.

When SXLEN=32, the only other valid setting for MODE is Sv32, a paged
virtual-memory scheme described in
Section 1.3.
When SXLEN=64, three paged virtual-memory schemes are defined: Sv39,
Sv48, and Sv57, described in
Sections 1.4,
1.5, and
1.6, respectively. One additional scheme,
Sv64, will be defined in a later version of this specification. The
remaining MODE settings are reserved for future use and may define
different interpretations of the other fields in satp.
Implementations are not required to support all MODE settings, and if
satp is written with an unsupported MODE, the entire write has no
effect; no fields in satp are modified.


SXLEN=32


Value
Name
Description


0
Bare
No translation or protection.


1
Sv32
Page-based 32-bit virtual addressing (see Section <a href="#sec:sv32" data-reference-type="ref"


                 data-reference="sec:sv32">1.3</a>).                                                              |

| SXLEN=64 |        |                                                                                                 |
|  Value   |  Name  | Description                                                                                     |
|    0     |  Bare  | No translation or protection.                                                                   |
|   1–7    |   —    | Reserved for standard use                                                                     |
|    8     |  Sv39  | Page-based 39-bit virtual addressing (see Section 1.4).                                                              |
|    9     |  Sv48  | Page-based 48-bit virtual addressing (see Section 1.5).                                                              |
|    10    |  Sv57  | Page-based 57-bit virtual addressing (see Section 1.6).                                                              |
|    11    | Sv64 | Reserved for page-based 64-bit virtual addressing.                                            |
|  12–13   |   —    | Reserved for standard use                                                                     |
|  14–15   |   —    | Designated for custom use                                                                     |
Encoding of satp MODE field.


The number of ASID bits is  and may be zero. The number of implemented
ASID bits, termed ASIDLEN, may be determined by writing one to every
bit position in the ASID field, then reading back the value in satp to
see which bit positions in the ASID field hold a one. The
least-significant bits of ASID are implemented first: that is, if
ASIDLEN > 0, ASID[ASIDLEN-1:0] is writable. The maximal value of
ASIDLEN, termed ASIDMAX, is 9 for Sv32 or 16 for Sv39, Sv48, and Sv57.

For many applications, the choice of page size has a substantial
performance impact. A large page size increases TLB reach and loosens
the associativity constraints on virtually indexed, physically tagged
caches. At the same time, large pages exacerbate internal fragmentation,
wasting physical memory and possibly cache capacity.
After much deliberation, we have settled on a conventional page size of
4 KiB for both RV32 and RV64. We expect this decision to ease the
porting of low-level runtime software and device drivers. The TLB reach
problem is ameliorated by transparent superpage support in modern
operating systems . Additionally, multi-level TLB hierarchies are quite
inexpensive relative to the multi-level cache hierarchies whose address
space they map.

The satp register is considered active when the effective privilege
mode is S-mode or U-mode. Executions of the address-translation
algorithm may only begin using a given value of satp when satp is
active.

Translations that began while satp was active are not required to
complete or terminate when satp is no longer active, unless an
SFENCE.VMA instruction matching the address and ASID is executed. The
SFENCE.VMA instruction must be used to ensure that updates to the
address-translation data structures are observed by subsequent implicit
reads to those structures by a hart.

Note that writing satp does not imply any ordering constraints between
page-table updates and subsequent address translations, nor does it
imply any invalidation of address-translation caches. If the new address
space’s page tables have been modified, or if an ASID is reused, it may
be necessary to execute an SFENCE.VMA instruction (see
Section 1.2.1) after, or in some cases
before, writing satp.

Not imposing upon implementations to flush address-translation caches
upon satp writes reduces the cost of context switches, provided a
sufficiently large ASID space.

Supervisor Instructions

In addition to the SRET instruction defined in
Section [otherpriv], one new supervisor-level
instruction is provided.
Supervisor Memory-Management Fence Instruction


| O | R | R | F | R | S

|:- |:- |:- |:- |:-
| | | | | |

| | | | | |

| | 5 | 5 | 3 | 5 | 7

| SFENCE.VMA | asid | vaddr | PRIV | 0 | SYSTEM


The supervisor memory-management fence instruction SFENCE.VMA is used to
synchronize updates to in-memory memory-management data structures with
current execution. Instruction execution causes implicit reads and
writes to these data structures; however, these implicit references are
ordinarily not ordered with respect to explicit loads and stores.
Executing an SFENCE.VMA instruction guarantees that any previous stores
already visible to the current RISC-V hart are ordered before certain
implicit references by subsequent instructions in that hart to the
memory-management data structures. The specific set of operations
ordered by SFENCE.VMA is determined by rs1 and rs2, as described
below. SFENCE.VMA is also used to invalidate entries in the
address-translation cache associated with a hart (see
Section 1.3.2). Further details on the
behavior of this instruction are described in
Section [virt-control] and
Section [pmp-vmem].

The SFENCE.VMA is used to flush any local hardware caches related to
address translation. It is specified as a fence rather than a TLB flush
to provide cleaner semantics with respect to which instructions are
affected by the flush operation and to support a wider variety of
dynamic caching structures and memory-management schemes. SFENCE.VMA is
also used by higher privilege levels to synchronize page table writes
and the address translation hardware.

SFENCE.VMA orders only the local hart’s implicit references to the
memory-management data structures.

Consequently, other harts must be notified separately when the
memory-management data structures have been modified. One approach is to
use 1) a local data fence to ensure local writes are visible globally,
then 2) an interprocessor interrupt to the other thread, then 3) a local
SFENCE.VMA in the interrupt handler of the remote thread, and finally 4)
signal back to originating thread that operation is complete. This is,
of course, the RISC-V analog to a TLB shootdown.

For the common case that the translation data structures have only been
modified for a single address mapping (i.e., one page or superpage),
rs1 can specify a virtual address within that mapping to effect a
translation fence for that mapping only. Furthermore, for the common
case that the translation data structures have only been modified for a
single address-space identifier, rs2 can specify the address space.
The behavior of SFENCE.VMA depends on rs1 and rs2 as follows:


If rs1=x0 and rs2=x0, the fence orders all reads and writes
made to any level of the page tables, for all address spaces. The
fence also invalidates all address-translation cache entries, for
all address spaces.


If rs1=x0 and rs2≠x0, the fence orders all reads and writes
made to any level of the page tables, but only for the address space
identified by integer register rs2. Accesses to global mappings
(see Section 1.3.1) are not ordered. The
fence also invalidates all address-translation cache entries
matching the address space identified by integer register rs2,
except for entries containing global mappings.


If rs1≠x0 and rs2=x0, the fence orders only reads and writes
made to leaf page table entries corresponding to the virtual address
in rs1, for all address spaces. The fence also invalidates all
address-translation cache entries that contain leaf page table
entries corresponding to the virtual address in rs1, for all
address spaces.


If rs1≠x0 and rs2≠x0, the fence orders only reads and writes
made to leaf page table entries corresponding to the virtual address
in rs1, for the address space identified by integer register
rs2. Accesses to global mappings are not ordered. The fence also
invalidates all address-translation cache entries that contain leaf
page table entries corresponding to the virtual address in rs1 and
that match the address space identified by integer register rs2,
except for entries containing global mappings.


If the value held in rs1 is not a valid virtual address, then the
SFENCE.VMA instruction has no effect. No exception is raised in this
case.
When rs2≠x0, bits SXLEN-1:ASIDMAX of the value held in rs2 are
reserved for future standard use. Until their use is defined by a
standard extension, they should be zeroed by software and ignored by
current implementations. Furthermore, if ASIDLEN < ASIDMAX, the
implementation shall ignore bits ASIDMAX-1:ASIDLEN of the value held in
rs2.

It is always legal to over-fence, e.g., by fencing only based on a
subset of the bits in rs1 and/or rs2, and/or by simply treating all
SFENCE.VMA instructions as having rs1=x0 and/or rs2=x0. For
example, simpler implementations can ignore the virtual address in rs1
and the ASID value in rs2 and always perform a global fence. The
choice not to raise an exception when an invalid virtual address is held
in rs1 facilitates this type of simplification.

An implicit read of the memory-management data structures may return any
translation for an address that was valid at any time since the most
recent SFENCE.VMA that subsumes that address. The ordering implied by
SFENCE.VMA does not place implicit reads and writes to the
memory-management data structures into the global memory order in a way
that interacts cleanly with the standard RVWMO ordering rules. In
particular, even though an SFENCE.VMA orders prior explicit accesses
before subsequent implicit accesses, and those implicit accesses are
ordered before their associated explicit accesses, SFENCE.VMA does not
necessarily place prior explicit accesses before subsequent explicit
accesses in the global memory order. These implicit loads also need not
otherwise obey normal program order semantics with respect to prior
loads or stores to the same address.

A consequence of this specification is that an implementation may use
any translation for an address that was valid at any time since the most
recent SFENCE.VMA that subsumes that address. In particular, if a leaf
PTE is modified but a subsuming SFENCE.VMA is not executed, either the
old translation or the new translation will be used, but the choice is
unpredictable. The behavior is otherwise well-defined.
In a conventional TLB design, it is possible for multiple entries to
match a single address if, for example, a page is upgraded to a
superpage without first clearing the original non-leaf PTE’s valid bit
and executing an SFENCE.VMA with rs1=x0. In this case, a similar
remark applies: it is unpredictable whether the old non-leaf PTE or the
new leaf PTE is used, but the behavior is otherwise well defined.
Another consequence of this specification is that it is generally unsafe
to update a PTE using a set of stores of a width less than the width of
the PTE, as it is legal for the implementation to read the PTE at any
time, including when only some of the partial stores have taken effect.


This specification permits the caching of PTEs whose V (Valid) bit is
clear. Operating systems must be written to cope with this possibility,
but implementers are reminded that eagerly caching invalid PTEs will
reduce performance by causing additional page faults.

Implementations must only perform implicit reads of the translation data
structures pointed to by the current contents of the satp register or
a subsequent valid (V=1) translation data structure entry, and must only
raise exceptions for implicit accesses that are generated as a result of
instruction execution, not those that are performed speculatively.
Changes to the sstatus fields SUM and MXR take effect immediately,
without the need to execute an SFENCE.VMA instruction. Changing
satp.MODE from Bare to other modes and vice versa also takes effect
immediately, without the need to execute an SFENCE.VMA instruction.
Likewise, changes to satp.ASID take effect immediately.

The following common situations typically require executing an
SFENCE.VMA instruction:


When software recycles an ASID (i.e., reassociates it with a
different page table), it should first change satp to point to
the new page table using the recycled ASID, then execute
SFENCE.VMA with rs1=x0 and rs2 set to the recycled ASID.
Alternatively, software can execute the same SFENCE.VMA instruction
while a different ASID is loaded into satp, provided the next time
satp is loaded with the recycled ASID, it is simultaneously loaded
with the new page table.


If the implementation does not provide ASIDs, or software chooses to
always use ASID 0, then after every satp write, software should
execute SFENCE.VMA with rs1=x0. In the common case that no
global translations have been modified, rs2 should be set to a
register other than x0 but which contains the value zero, so that
global translations are not flushed.


If software modifies a non-leaf PTE, it should execute SFENCE.VMA
with rs1=x0. If any PTE along the traversal path had its G bit
set, rs2 must be x0; otherwise, rs2 should be set to the ASID
for which the translation is being modified.


If software modifies a leaf PTE, it should execute SFENCE.VMA with
rs1 set to a virtual address within the page. If any PTE along the
traversal path had its G bit set, rs2 must be x0; otherwise,
rs2 should be set to the ASID for which the translation is being
modified.


For the special cases of increasing the permissions on a leaf PTE
and changing an invalid PTE to a valid leaf, software may choose to
execute the SFENCE.VMA lazily. After modifying the PTE but before
executing SFENCE.VMA, either the new or old permissions will be
used. In the latter case, a page-fault exception might occur, at
which point software should execute SFENCE.VMA in accordance with
the previous bullet point.


If a hart employs an address-translation cache, that cache must appear
to be private to that hart. In particular, the meaning of an ASID is
local to a hart; software may choose to use the same ASID to refer to
different address spaces on different harts.

A future extension could redefine ASIDs to be global across the SEE,
enabling such options as shared translation caches and hardware support
for broadcast TLB shootdown. However, as OSes have evolved to
significantly reduce the scope of TLB shootdowns using novel
ASID-management techniques, we expect the local-ASID scheme to remain
attractive for its simplicity and possibly better scalability.

For implementations that make satp.MODE read-only zero (always Bare),
attempts to execute an SFENCE.VMA instruction might raise an illegal
instruction exception.
Sv32: Page-Based 32-bit Virtual-Memory Systems

When Sv32 is written to the MODE field in the satp register (see
Section 1.1.11), the supervisor operates in a
32-bit paged virtual-memory system. In this mode, supervisor and user
virtual addresses are translated into supervisor physical addresses by
traversing a radix-tree page table. Sv32 is supported when SXLEN=32 and
is designed to include mechanisms sufficient for supporting modern
Unix-based operating systems.

The initial RISC-V paged virtual-memory architectures have been designed
as straightforward implementations to support existing operating
systems. We have architected page table layouts to support a hardware
page-table walker. Software TLB refills are a performance bottleneck on
high-performance systems, and are especially troublesome with decoupled
specialized coprocessors. An implementation can choose to implement
software TLB refills using a machine-mode trap handler as an extension
to M-mode.


Some ISAs architecturally expose virtually indexed, physically tagged
caches, in that accesses to the same physical address via different
virtual addresses might not be coherent unless the virtual addresses lie
within the same cache set. Implicitly, this specification does not
permit such behavior to be architecturally exposed.

Addressing and Memory Protection

Sv32 implementations support a 32-bit virtual address space, divided
into pages. An Sv32 virtual address is partitioned into a virtual page
number (VPN) and page offset, as shown in
Figure [sv32va]. When Sv32 virtual memory mode is
selected in the MODE field of the satp register, supervisor virtual
addresses are translated into supervisor physical addresses via a
two-level page table. The 20-bit VPN is translated into a 22-bit
physical page number (PPN), while the 12-bit page offset is
untranslated. The resulting supervisor-level physical addresses are then
checked using any physical memory protection structures
(Sections [sec:pmp]), before being directly converted
to machine-level physical addresses. If necessary, supervisor-level
physical addresses are zero-extended to the number of physical address
bits found in the implementation.

For example, consider an RV32 system supporting 34 bits of physical
address. When the value of satp.MODE is Sv32, a 34-bit physical
address is produced directly, and therefore no zero-extension is needed.
When the value of satp.MODE is Bare, the 32-bit virtual address is
translated (unmodified) into a 32-bit physical address, and then that
physical address is zero-extended into a 34-bit machine-level physical
address.


|  | O | O | E | |

|:- |:- |:-
| | |

| | 10 | 12


|  | E | O | E | |

|:- |:- |:-
| | |

| | 10 | 12


|  | E | O | Fcccccccc | | | | | | | | | |

|:- |:- |:-
| | | | | | | | | | |

| | 10 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


Sv32 page tables consist of 2¹⁰ page-table entries (PTEs),
each of four bytes. A page table is exactly the size of a page and must
always be aligned to a page boundary. The physical page number of the
root page table is stored in the satp register.
The PTE format for Sv32 is shown in
Figures [sv32pte]. The V bit indicates whether the
PTE is valid; if it is 0, all other bits in the PTE are don’t-cares and
may be used freely by software. The permission bits, R, W, and X,
indicate whether the page is readable, writable, and executable,
respectively. When all three are zero, the PTE is a pointer to the next
level of the page table; otherwise, it is a leaf PTE. Writable pages
must also be marked readable; the contrary combinations are reserved for
future use. Table [pteperm] summarizes the encoding of the
permission bits.


X
W
R
Meaning


0
0
0
Pointer to next level of page table.


0
0
1
Read-only page.


0
1
0
Reserved for future use.


0
1
1
Read-write page.


1
0
0
Execute-only page.


1
0
1
Read-execute page.


1
1
0
Reserved for future use.


1
1
1
Read-write-execute page.


Attempting to fetch an instruction from a page that does not have
execute permissions raises a fetch page-fault exception. Attempting to
execute a load or load-reserved instruction whose effective address lies
within a page without read permissions raises a load page-fault
exception. Attempting to execute a store, store-conditional, or AMO
instruction whose effective address lies within a page without write
permissions raises a store page-fault exception.

AMOs never raise load page-fault exceptions. Since any unreadable page
is also unwritable, attempting to perform an AMO on an unreadable page
always raises a store page-fault exception.

The U bit indicates whether the page is accessible to user mode. U-mode
software may only access the page when U=1. If the SUM bit in the
sstatus register is set, supervisor mode software may also access
pages with U=1. However, supervisor code normally operates with the SUM
bit clear, in which case, supervisor code will fault on accesses to
user-mode pages. Irrespective of SUM, the supervisor may not execute
code on pages with U=1.

An alternative PTE format would support different permissions for
supervisor and user. We omitted this feature because it would be largely
redundant with the SUM mechanism (see
Section 1.1.1.2) and would require more encoding
space in the PTE.

The G bit designates a global mapping. Global mappings are those that
exist in all address spaces. For non-leaf PTEs, the global setting
implies that all mappings in the subsequent levels of the page table are
global. Note that failing to mark a global mapping as global merely
reduces performance, whereas marking a non-global mapping as global is a
software bug that, after switching to an address space with a different
non-global mapping for that address range, can unpredictably result in
either mapping being used.

Global mappings need not be stored redundantly in address-translation
caches for multiple ASIDs. Additionally, they need not be flushed from
local address-translation caches when an SFENCE.VMA instruction is
executed with rs2≠x0.

The RSW field is reserved for use by supervisor software; the
implementation shall ignore this field.
Each leaf PTE contains an accessed (A) and dirty (D) bit. The A bit
indicates the virtual page has been read, written, or fetched from since
the last time the A bit was cleared. The D bit indicates the virtual
page has been written since the last time the D bit was cleared.
Two schemes to manage the A and D bits are permitted:


When a virtual page is accessed and the A bit is clear, or is
written and the D bit is clear, a page-fault exception is raised.


When a virtual page is accessed and the A bit is clear, or is
written and the D bit is clear, the implementation sets the
corresponding bit(s) in the PTE. The PTE update must be atomic with
respect to other accesses to the PTE, and must atomically check that
the PTE is valid and grants sufficient permissions. Updates of the A
bit may be performed as a result of speculation, but updates to the
D bit must be exact (i.e., not speculative), and observed in program
order by the local hart. Furthermore, the PTE update must appear in
the global memory order no later than the explicit memory access, or
any subsequent explicit memory access to that virtual page by the
local hart. The ordering on loads and stores provided by FENCE
instructions and the acquire/release bits on atomic instructions
also orders the PTE updates associated with those loads and stores
as observed by remote harts.
The PTE update is not required to be atomic with respect to the
explicit memory access that caused the update, and the sequence is
interruptible. However, the hart must not perform the explicit
memory access before the PTE update is globally visible.


All harts in a system must employ the same PTE-update scheme as each
other.

Prior versions of this specification required PTE A bit updates to be
exact, but allowing the A bit to be updated as a result of speculation
simplifies the implementation of address translation prefetchers. System
software typically uses the A bit as a page replacement policy hint, but
does not require exactness for functional correctness. On the other
hand, D bit updates are still required to be exact and performed in
program order, as the D bit affects the functional correctness of page
eviction.
Implementations are of course still permitted to perform both A and D
bit updates only in an exact manner.
In both cases, requiring atomicity ensures that the PTE update will not
be interrupted by other intervening writes to the page table, as such
interruptions could lead to A/D bits being set on PTEs that have been
reused for other purposes, on memory that has been reclaimed for other
purposes, and so on. Simple implementations may instead generate
page-fault exceptions.
The A and D bits are never cleared by the implementation. If the
supervisor software does not rely on accessed and/or dirty bits, e.g. if
it does not swap memory pages to secondary storage or if the pages are
being used to map I/O space, it should always set them to 1 in the PTE
to improve performance.

Any level of PTE may be a leaf PTE, so in addition to 4 KiB pages, Sv32
supports 4 MiB megapages. A megapage must be virtually and physically
aligned to a 4 MiB boundary; a page-fault exception is raised if the
physical address is insufficiently aligned.
For non-leaf PTEs, the D, A, and U bits are reserved for future standard
use. Until their use is defined by a standard extension, they must be
cleared by software for forward compatibility.
For implementations with both page-based virtual memory and the “A”
standard extension, the LR/SC reservation set must lie completely within
a single base page (i.e., a naturally aligned region).
Virtual Address Translation Process

A virtual address v**a is translated into a physical address p**a as
follows:


Let a be ${\tt satp}.ppn \times \textrm{PAGESIZE}$, and let
i = LEVELS − 1. (For Sv32, PAGESIZE=2¹² and LEVELS=2.)
The satp register must be active, i.e., the effective privilege
mode must be S-mode or U-mode.


Let pte be the value of the PTE at address
a + v**a.vpn[i] × PTESIZE. (For Sv32, PTESIZE=4.) If
accessing pte violates a PMA or PMP check, raise an
access-fault exception corresponding to the original access type.


If pte.v = 0, or if pte.r = 0 and pte.w = 1,
or if any bits or encodings that are reserved for future standard
use are set within pte, stop and raise a page-fault exception
corresponding to the original access type.


Otherwise, the PTE is valid. If pte.r = 1 or
pte.x = 1, go to step 5. Otherwise, this PTE is a pointer to
the next level of the page table. Let i = i − 1. If i < 0,
stop and raise a page-fault exception corresponding to the original
access type. Otherwise, let a = pte.ppn × PAGESIZE and
go to step 2.


A leaf PTE has been found. Determine if the requested memory access
is allowed by the pte.r, pte.w, pte.x, and
pte.u bits, given the current privilege mode and the value
of the SUM and MXR fields of the mstatus register. If not, stop
and raise a page-fault exception corresponding to the original
access type.


If i > 0 and pte.ppn[i−1:0] ≠ 0, this is a
misaligned superpage; stop and raise a page-fault exception
corresponding to the original access type.


If pte.a = 0, or if the original memory access is a store
and pte.d = 0, either raise a page-fault exception
corresponding to the original access type, or:


If a store to pte would violate a PMA or PMP check, raise
an access-fault exception corresponding to the original access
type.


Perform the following steps atomically:


Compare pte to the value of the PTE at address
a + v**a.vpn[i] × PTESIZE.


If the values match, set pte.a to 1 and, if the
original memory access is a store, also set pte.d to
1.


If the comparison fails, return to step 2


The translation is successful. The translated physical address is
given as follows:


pa.pgoff = va.pgoff.


If i > 0, then this is a superpage translation and
p**a.ppn[i−1:0] = v**a.vpn[i−1:0].


p**a.ppn[LEVELS−1:i] = pte.ppn[LEVELS−1:i].


All implicit accesses to the address-translation data structures in this
algorithm are performed using width PTESIZE.

This implies, for example, that an Sv48 implementation may not use two
separate 4B reads to non-atomically access a single 8B PTE, and that A/D
bit updates performed by the implementation are treated as atomically
updating the entire PTE, rather than just the A and/or D bit alone (even
though the PTE value does not otherwise change).

The results of implicit address-translation reads in step 2 may be held
in a read-only, incoherent address-translation cache but not shared
with other harts. The address-translation cache may hold an arbitrary
number of entries, including an arbitrary number of entries for the same
address and ASID. Entries in the address-translation cache may then
satisfy subsequent step 2 reads if the ASID associated with the entry
matches the ASID loaded in step 0 or if the entry is associated with a
global mapping. To ensure that implicit reads observe writes to the
same memory locations, an SFENCE.VMA instruction must be executed after
the writes to flush the relevant cached translations.
The address-translation cache cannot be used in step 7; accessed and
dirty bits may only be updated in memory directly.

It is permitted for multiple address-translation cache entries to
co-exist for the same address. This represents the fact that in a
conventional TLB hierarchy, it is possible for multiple entries to match
a single address if, for example, a page is upgraded to a superpage
without first clearing the original non-leaf PTE’s valid bit and
executing an SFENCE.VMA with rs1=x0, or if multiple TLBs exist in
parallel at a given level of the hierarchy. In this case, just as if an
SFENCE.VMA is not executed between a write to the memory-management
tables and subsequent implicit read of the same address: it is
unpredictable whether the old non-leaf PTE or the new leaf PTE is used,
but the behavior is otherwise well defined.

Implementations may also execute the address-translation algorithm
speculatively at any time, for any virtual address, as long as satp is
active (as defined in
Section 1.1.11). Such speculative executions have
the effect of pre-populating the address-translation cache.
Speculative executions of the address-translation algorithm behave as
non-speculative executions of the algorithm do, except that they must
not set the dirty bit for a PTE, they must not trigger an exception, and
they must not create address-translation cache entries if those entries
would have been invalidated by any SFENCE.VMA instruction executed by
the hart since the speculative execution of the algorithm began.

For instance, it is illegal for both non-speculative and speculative
executions of the translation algorithm to begin, read the level 2 page
table, pause while the hart executes an SFENCE.VMA with
rs1=rs2=x0, then resume using the now-stale level 2 PTE, as
subsequent implicit reads could populate the address-translation cache
with stale PTEs.
In many implementations, an SFENCE.VMA instruction with rs1=x0 will
therefore either terminate all previously-launched speculative
executions of the address-translation algorithm (for the specified ASID,
if applicable), or simply wait for them to complete (in which case any
address-translation cache entries created will be invalidated by the
SFENCE.VMA as appropriate). Likewise, an SFENCE.VMA instruction with
rs1≠x0 generally must either ensure that previously-launched
speculative executions of the address-translation algorithm (for the
specified ASID, if applicable) are prevented from creating new
address-translation cache entries mapping leaf PTEs, or wait for them to
complete.
A consequence of implementations being permitted to read the translation
data structures arbitrarily early and speculatively is that at any time,
all page table entries reachable by executing the algorithm may be
loaded into the address-translation cache.
Although it would be uncommon to place page tables in non-idempotent
memory, there is no explicit prohibition against doing so. Since the
algorithm may only touch page tables reachable from the root page table
indicated in  satp, the range of addresses that an implementation’s
page table walker will touch is fully under supervisor control.


The algorithm does not admit the possibility of ignoring high-order PPN
bits for implementations with narrower physical addresses.

Sv39: Page-Based 39-bit Virtual-Memory System

This section describes a simple paged virtual-memory system for
SXLEN=64, which supports 39-bit virtual address spaces. The design of
Sv39 follows the overall scheme of Sv32, and this section details only
the differences between the schemes.

We specified multiple virtual memory systems for RV64 to relieve the
tension between providing a large address space and minimizing
address-translation cost. For many systems, of virtual-address space is
ample, and so Sv39 suffices. Sv48 increases the virtual address space to
, but increases the physical memory capacity dedicated to page tables,
the latency of page-table traversals, and the size of hardware
structures that store virtual addresses. Sv57 increases the virtual
address space, page table capacity requirement, and translation latency
even further.

Addressing and Memory Protection

Sv39 implementations support a 39-bit virtual address space, divided
into pages. An Sv39 address is partitioned as shown in
Figure [sv39va]. Instruction fetch addresses and
load and store effective addresses, which are 64 bits, must have bits
63–39 all equal to bit 38, or else a page-fault exception will occur.
The 27-bit VPN is translated into a 44-bit PPN via a three-level page
table, while the 12-bit page offset is untranslated.

When mapping between narrower and wider addresses, RISC-V zero-extends a
narrower physical address to a wider size. The mapping between 64-bit
virtual addresses and the 39-bit usable address space of Sv39 is not
based on zero-extension but instead follows an entrenched convention
that allows an OS to use one or a few of the most-significant bits of a
full-size (64-bit) virtual address to quickly distinguish user and
supervisor address regions.


|  | O | O | O | O | | |

|:- |:- |:- |:-
| | | |

| | 9 | 9 | 12


|  | T | O | O | O | | |

|:- |:- |:- |:-
| | | |

| | 9 | 9 | 12


| cF | Y | Y | Y | Y | Fcccccccc | | | | | | | | | | | | | |

|:- |:- |:- |:- |:-
| | | | | | | | | | | | | | |

| | 2 | 7 | 26 | 9 | 9 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


Sv39 page tables contain 2⁹ page table entries (PTEs), eight
bytes each. A page table is exactly the size of a page and must always
be aligned to a page boundary. The physical page number of the root page
table is stored in the satp register’s PPN field.
The PTE format for Sv39 is shown in
Figure [sv39pte]. Bits 9–0 have the same meaning
as for Sv32. Bit 63 is reserved for use by the Svnapot extension in
Chapter 2. If Svnapot is not implemented, bit 63
remains reserved and must be zeroed by software for forward
compatibility, or else a page-fault exception is raised. Bits 62–61 are
reserved for use by the Svpbmt extension in
Chapter 3. If Svpbmt is not implemented, bits 62–61
remain reserved and must be zeroed by software for forward
compatibility, or else a page-fault exception is raised. Bits 60–54 are
reserved for future standard use and, until their use is defined by some
standard extension, must be zeroed by software for forward
compatibility. If any of these bits are set, a page-fault exception is
raised.

We reserved several PTE bits for a possible extension that improves
support for sparse address spaces by allowing page-table levels to be
skipped, reducing memory usage and TLB refill latency. These reserved
bits may also be used to facilitate research experimentation. The cost
is reducing the physical address space, but is presently ample. When it
no longer suffices, the reserved bits that remain unallocated could be
used to expand the physical address space.

Any level of PTE may be a leaf PTE, so in addition to pages, Sv39
supports megapages and gigapages, each of which must be virtually
and physically aligned to a boundary equal to its size. A page-fault
exception is raised if the physical address is insufficiently aligned.
The algorithm for virtual-to-physical address translation is the same as
in Section 1.3.2, except LEVELS equals 3 and
PTESIZE equals 8.
Sv48: Page-Based 48-bit Virtual-Memory System

This section describes a simple paged virtual-memory system for
SXLEN=64, which supports 48-bit virtual address spaces. Sv48 is intended
for systems for which a 39-bit virtual address space is insufficient. It
closely follows the design of Sv39, simply adding an additional level of
page table, and so this chapter only details the differences between the
two schemes.
Implementations that support Sv48 must also support Sv39.

Systems that support Sv48 can also support Sv39 at essentially no cost,
and so should do so to maintain compatibility with supervisor software
that assumes Sv39.

Addressing and Memory Protection

Sv48 implementations support a 48-bit virtual address space, divided
into pages. An Sv48 address is partitioned as shown in
Figure [sv48va]. Instruction fetch addresses and
load and store effective addresses, which are 64 bits, must have bits
63–48 all equal to bit 47, or else a page-fault exception will occur.
The 36-bit VPN is translated into a 44-bit PPN via a four-level page
table, while the 12-bit page offset is untranslated.


|  | O | O | O | O | O | | | |

|:- |:- |:- |:- |:-
| | | | |

| | 9 | 9 | 9 | 12


|  | E | O | O | O | O | | | |

|:- |:- |:- |:- |:-
| | | | |

| | 9 | 9 | 9 | 12


| cF | F | F | F | F | F | Fcccccccc | | | | | | | | | | | | | | |

|:- |:- |:- |:- |:- |:-
| | | | | | | | | | | | | | | |

| | 2 | 7 | 17 | 9 | 9 | 9 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


The PTE format for Sv48 is shown in
Figure [sv48pte]. Bits 63–54 and 9–0 have the same
meaning as for Sv39. Any level of PTE may be a leaf PTE, so in addition
to pages, Sv48 supports megapages, gigapages, and terapages, each
of which must be virtually and physically aligned to a boundary equal to
its size. A page-fault exception is raised if the physical address is
insufficiently aligned.
The algorithm for virtual-to-physical address translation is the same as
in Section 1.3.2, except LEVELS equals 4 and
PTESIZE equals 8.
Sv57: Page-Based 57-bit Virtual-Memory System

This section describes a simple paged virtual-memory system designed for
RV64 systems, which supports 57-bit virtual address spaces. Sv57 is
intended for systems for which a 48-bit virtual address space is
insufficient. It closely follows the design of Sv48, simply adding an
additional level of page table, and so this chapter only details the
differences between the two schemes.
Implementations that support Sv57 must also support Sv48.

Systems that support Sv57 can also support Sv48 at essentially no cost,
and so should do so to maintain compatibility with supervisor software
that assumes Sv48.

Addressing and Memory Protection

Sv57 implementations support a 57-bit virtual address space, divided
into pages. An Sv57 address is partitioned as shown in
Figure [sv57va]. Instruction fetch addresses and
load and store effective addresses, which are 64 bits, must have bits
63–57 all equal to bit 56, or else a page-fault exception will occur.
The 45-bit VPN is translated into a 44-bit PPN via a five-level page
table, while the 12-bit page offset is untranslated.


|  | S | S | S | S | S | S | | | | |

|:- |:- |:- |:- |:- |:-
| | | | | |

| | 9 | 9 | 9 | 9 | 12


|  | R | S | S | S | S | S | | | | |

|:- |:- |:- |:- |:- |:-
| | | | | |

| | 9 | 9 | 9 | 9 | 12


| c | F | Y | T | Wcccccccc | | | | | | | | | | | |

|:- |:- |:- |:-
| | | | | | | | | | | | |

| | 2 | 7 | 44 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1


|  | F | F | F | F | F | | | |

|:- |:- |:- |:- |:-
| | | | |

| | 9 | 9 | 9 | 9


The PTE format for Sv57 is shown in
Figure [sv57pte]. Bits 63–54 and 9–0 have the same
meaning as for Sv39. Any level of PTE may be a leaf PTE, so in addition
to pages, Sv57 supports megapages, gigapages, terapages, and
petapages, each of which must be virtually and physically aligned to a
boundary equal to its size. A page-fault exception is raised if the
physical address is insufficiently aligned.
The algorithm for virtual-to-physical address translation is the same as
in Section 1.3.2, except LEVELS equals 5 and
PTESIZE equals 8.
“Svnapot” Standard Extension for NAPOT Translation Contiguity, Version 1.0

In Sv39, Sv48, and Sv57, when a PTE has N=1, the PTE represents a
translation that is part of a range of contiguous virtual-to-physical
translations with the same values for PTE bits 5–0. Such ranges must be
of a naturally aligned power-of-2 (NAPOT) granularity larger than the
base page size.
The Svnapot extension depends on Sv39.


i
pte.ppn[i]
Description
pte.napot_bit**s


0
x xxxx xxx1
Reserved
−


0
x xxxx xx1x
Reserved
−


0
x xxxx x1xx
Reserved
−


0
x xxxx 1000
64 KiB contiguous region
4


0
x xxxx 0xxx
Reserved
−


 ≥ 1
x xxxx xxxx
Reserved
−


NAPOT PTEs behave identically to non-NAPOT PTEs within the
address-translation algorithm in
Section 1.3.2, except that:

If the encoding in pte is valid according to
Table [ptenapot], then instead of returning
the ori
Base	Version	Status
RVWMO	2.0	Ratified
RV32I	2.1	Ratified
RV64I	2.1	Ratified
RV32E	1.9	Draft
RV128I	1.7	Draft
Extension	Version	Status
M	2.0	Ratified
A	2.1	Ratified
F	2.2	Ratified
D	2.2	Ratified
Q	2.2	Ratified
C	2.0	Ratified
Counters	2.0	Draft
L	0.0	Draft
B	0.0	Draft
J	0.0	Draft
T	0.0	Draft
P	0.2	Draft
V	0.7	Draft
Zicsr	2.0	Ratified
Zifencei	2.0	Ratified
Zihintpause	2.0	Ratified
Zihintntl	0.2	Draft
Zam	0.1	Draft
Zfh	1.0	Ratified
Zfhmin	1.0	Ratified
Zfinx	1.0	Ratified
Zdinx	1.0	Ratified
Zhinx	1.0	Ratified
Zhinxmin	1.0	Ratified
Zmmul	1.0	Ratified
Ztso	0.1	Frozen
Base	Version	Draft Frozen?
RV32I	2.0	Y
RV32E	1.9	N
RV64I	2.0	Y
RV128I	1.7	N
Extension	Version	Frozen?
M	2.0	Y
A	2.0	Y
F	2.0	Y
D	2.0	Y
Q	2.0	Y
L	0.0	N
C	2.0	Y
B	0.0	N
J	0.0	N
T	0.0	N
P	0.1	N
V	0.7	N
N	1.1	N
			`xxxxxxxxxxxxxxaa`	16-bit (`aa` ≠ `11`)

		`xxxxxxxxxxxxxxxx`	`xxxxxxxxxxxbbb11`	32-bit (`bbb` ≠ `111`)

	⋅ ⋅ ⋅`xxxx`	`xxxxxxxxxxxxxxxx`	`xxxxxxxxxx011111`	48-bit

	⋅ ⋅ ⋅`xxxx`	`xxxxxxxxxxxxxxxx`	`xxxxxxxxx0111111`	64-bit

	⋅ ⋅ ⋅`xxxx`	`xxxxxxxxxxxxxxxx`	`xnnnxxxxx1111111`	(80+16*`nnn`)-bit, `nnn`≠`111`

	⋅ ⋅ ⋅`xxxx`	`xxxxxxxxxxxxxxxx`	`x111xxxxx1111111`	Reserved for ≥192-bits

Byte Address:	base+4	base+2	base
	Contained	Requested	Invisible	Fatal
Execution terminates	No	No¹	No	Yes
Software is oblivious	No	No	Yes	Yes²
Handled by environment	No	Yes	Yes	Yes

funct7	rs2	rs1	funct3	rd	opcode	R-type

imm[11:0]		rs1	funct3	rd	opcode	I-type

imm[11:5]	rs2	rs1	funct3	imm[4:0]	opcode	S-type

imm[31:12]				rd	opcode	U-type

— inst[31] —				inst[30:25]	inst[24:21]	inst[20]	I-immediate

— inst[31] —				inst[30:25]	inst[11:8]	inst[7]	S-immediate

— inst[31] —			inst[7]	inst[30:25]	inst[11:8]	0	B-immediate

inst[31]	inst[30:20]	inst[19:12]	— 0 —				U-immediate

— inst[31] —		inst[19:12]	inst[20]	inst[30:25]	inst[24:21]	0	J-immediate
rd is `x1`/`x5`	rs1 is `x1`/`x5`	rd=rs1	RAS action
No	No	–	None
No	Yes	–	Pop
Yes	No	–	Push
Yes	Yes	No	Pop, then push
Yes	Yes	Yes	Push
fm field	Mnemonic	Meaning
0000	none	Normal Fence
1000	TSO	With FENCE RW,RW: exclude write-to-read ordering
		Otherwise: Reserved for future use.
other		Reserved for future use.
Scenario	Recommended NTL variant
Access to a working set between and in size	NTL.P1
Access to a working set between and in size	NTL.PALL
Access to a working set greater than in size	NTL.S1
Access with no exploitable temporal locality (e.g., streaming)	NTL.ALL
Access to a contended synchronization variable	NTL.PALL
Condition	Dividend	Divisor	DIVU[W]	REMU[W]	DIV[W]	REM[W]
Division by zero	x	0	2^L − 1	x	− 1	x
Overflow (signed only)	− 2^L − 1	− 1	–	–	− 2^L − 1	0