This document will be based on the ARMv8-A architecture, mainly focusing on AArch64, and comparing it to ATmega2560's megaAVR architecture.
Thus "ARM" will be used to mean "ARMv8-A AArch64" and "AVR" as "megaAVR".
Microarchitectural details for ARM are taken from the Cortex-A57.
Focusing on the differences between ARM and AVR, ARM has:
- Deeply out of order, superscalar and pipelined execution pipeline
- Up to 128 instructions can be in some state of execution simultaneously
- Dynamic branch prediction
- Multicore support
- 32-bit fixed length RISC instruction set
- Support for register indexed and PC-relative addressing
- Conditional select and compare
- FP, DSP, SIMD and cryptography instructions support
- Hardware virtualisation support
- Multiple CPU execution modes for privilege separation
- 31 64-bit general purpose registers
- Up to 48 bits of addressable virtual and physical memory
- Top 8 bits of the 64-bit address can be configured for pointer tagging
- L1 and L2 CPU caches
AVR
- Modified Harvard architecture
- Modifications are special instructions for reading and writing to program memory
- Program memory is word (16-bit) addressed
- Data memory is byte (8-bit) addressed
- Little endian
- 17 bits of addressable program memory for a maximum of 256 KiB
- 16 bits of addressable data memory for a maximum of 64 KiB
- First 512 bytes of data memory are always mapped to registers:
- 00 -> 1f: General purpose registers
- 20 -> 5f: Standard IO registers
- 60 -> 1ff: Extended IO registers
- No MMU
ARM
- Combination of Von Neumann and Harvard architecture
- Shared memory (L3) and L2 cache for both program and data memory (Von Neumann)
- L1 cache is split into instruction (program) and data caches (Harvard)
- L3 memory is used for all physical addressing (Von Neumann)
- Memory is byte (8-bit) addressed
- Instructions must be aligned to 16 bits, or a Misaligned PC fault is raised
- Unaligned device memory load/store raises an Alignment fault
- Unaligned normal memory load/store, however, is usually possible but slow, and only if the processor is configured to allow it
- Little endian instructions, but memory access can be configured to be either big or little endian
- 48 bits of addressable physical memory for a maximum of 256 TiB
- Physical memory is split into:
- Normal memory
- Device memory
- Defined as memory where multiple reads can return different values and writes can cause side effects
- IO happens here (i.e., memory mapped)
- Registers are not mapped to memory except...
- GIC (Global Interrupt Controller) registers and debug registers are mapped here
- Has a MMU
- 64-bit addresses
- Top 8 bits can be configured for pointer tagging, and will be ignored by the MMU if so
- 49 bits of addressable virtual memory for a maximum of 512 TiB
- Split into two 48-bit subranges positioned at the bottom and top of the 64-bit address space
- Top subrange typically used for kernel while bottom for applications
- Configurable page granularity of 4, 16 or 64 KiB
- Read/Write/Execute permissions are changeable
- Further split between "Non-secure state" and "Secure state", where secure state is only accessible from
EL3
- 64-bit addresses
AVR
- 32 8-bit wide general purpose registers
- 1 16-bit wide program counter
- 480 8-bit wide IO registers (with some part of a single 16-bit register, e.g., stack pointer)
- Status register is here
- Registers that modify the behaviour of the CPU
- IO to peripherals (USART, SPI, GPIO, etc.)
- Other things that cause side effects
ARM
- 31 64-bit wide general purpose registers
- 1 64-bit wide zero register
- 1 64-bit wide program counter
- 22 64-bit or 32-bit wide special purpose registers
- Registers that hold and save state for the different CPU execution modes
- Process state register is here (
PSTATE
). Some notable "subregisters" include:- Condition Flags (
NZCV
) - Interrupt Mask Bits (
DAIF
) - Current Exception Level (
CurrentEL
)
- Condition Flags (
- Includes 4 64-bit wide stack pointers (one for each CPU execution mode)
- Numerous 64-bit and 32-bit wide system registers
- Registers that provide control and status information of CPU features
N
-bit: Negative condition flag- Set to the MSB of the last flag-setting instruction's result
- If the result is interpreted as a two's complement signed integer, the result was negative if set
Z
-bit: Zero condition flag- Set if the the last flag-setting instruction's result was zero
C
-bit: Carry condition flag- Set if the last flag-setting instruction resulted in a carry condition (e.g., unsigned overflow)
V
-bit: Overflow condition flag- Set if the last flag-setting instruction resulted in a overflow condition (e.g., signed overflow)
- Typically the integer and floating-point arithmetic and compare/test instructions set/clear these flags
- These categories of instructions use the flags as input:
- Conditional branch (i.e.,
b.cond
) - Add or subtract with carry (i.e.,
adc
,adcs
,sbc
,sbcs
) - Conditional select (i.e.,
csinc
,csneg
,csinv
,csel
,cset
,csetm
) - Conditional compare (i.e.,
ccmp
,ccmn
)
- Conditional branch (i.e.,
- The following conditional codes are available, and their meaning applied to integers
eq
/ne
: Equal/Not equal (Z == 0/1
)cs
/cc
: Carry set/Carry clear (C == 0/1
)mi
/pl
: Minus/Plus or zero (N == 0/1
)vs
/vc
: Overflow/No overflow (V == 0/1
)hi
: Unsigned higher (C ==1 && Z == 0
)ls
: Unsigned lower or same (!(C ==1 && Z == 0)
)ge
: Signed greater than or equal (N == V
)lt
: Signed less than (N != V
)gt
: Signed greater than (Z == 0 && N == V
)le
: Signed less than or equal (!(Z == 0 && N == V)
)al
: Always
- Called "Exception Levels"
- 4 in total, from
EL0
toEL3
- Stored in the process state register (
PSTATE
) and can be accessed via theCurrentEL
"subregister" - The levels are intended to be used for:
EL3
: Secure monitor- Highest execution mode
- Has access to the Secure state and programs running in it
- Basically no restrictions
EL2
: Hypervisor- Only has access to Non-secure state
- Responsible for switching between virtual machines, defined as comprising of non-secure
EL1
andEL0
- Has controls to trap various operations from the lower execution modes
- Can set up a second stage MMU to map
EL1
memory (IPA: Intermediate Physical Address) to real physical memory
EL1
: OS kernel and associated functions that are typically described as privileged- Can be in either the Secure state or Non-secure state
- Has access to everything, except the things (often transparently) restricted by the higher execution modes
- e.g., set up MMU, IRQ handlers, control IO, etc
EL0
(unprivileged execution): Applications- Has the least privileges of all the levels
- Cannot modify anything other than its own or shared writeable memory, and general purpose registers
- All IO and other privileged operations must be performed through a system call to
EL1
- Each execution mode has its own dedicated stack pointer, process state (
PSTATE
) and expected return address registers- Automatically switches to the dedicated registers on an execution mode change
- Each mode can still access the less privileged modes' registers, however
AVR
- Only way to interrupt execution
- Only a single priority of IRQs
- Each interrupt can be masked by setting/clearing certain bits in certain IO registers
- Interrupts can be enabled/disabled globally by setting/clearing the
I
flag in the status register - Interrupt handlers run under the same context as the currently executing code, except a return address is pushed onto the stack
- Interrupts are disabled globally upon entering the handler, and re-enabled upon exiting via
reti
- No software triggered interrupts
- Except writing to an output pin where an interrupt is configured on the pin
ARM
- One way to interrupt execution, out of all the ways known as exceptions in the ARM world
- Two priorities of IRQs: IRQ and FIQ
- FIQs are intended to be higher priority than IRQs, but it is explicitly left up to implementation-defined by the standard
- Each interrupt can be masked by setting/clearing certain bits in certain IO registers
- IRQs and FIQs can be enabled/disabled globally by clearing/setting the
I
andF
flags inDAIF
ofPSTATE
respectively- However, when the target exception level is higher than the current level:
- If the target level is
EL2
orEL3
, the interrupts cannot be masked - If the target level is
EL1
, the interrupts can be masked- but
DAIF
cannot be modified by default underEL0
- but
- If the target level is
- However, when the target exception level is higher than the current level:
- Interrupts of a lower target exception level remain pending if current execution is in a higher level
- The following steps are taken when entering an interrupt:
PSTATE
is saved to the target exception level'sSPSR_ELx
special purpose register- The preferred return address is saved to the
ELR_ELx
special purpose register - All bits in
DAIF
are set - Execution moves to the target exception level
- The exception level's dedicated stack pointer register is selected for use
- The following steps are taken when returning from an interrupt with
eret
:PC
is restored with the contents inELR_ELx
PSTATE
is restored with the contents inSPSR_ELx
- Since steps 3-5 of entering the interrupt are just modifying
PSTATE
, this step automatically reverts them
- Since steps 3-5 of entering the interrupt are just modifying
- No software triggered interrupts
- Software triggered exceptions, however, do exist. For example:
- System calls via the
svc
(EL0
→EL1
),hvc
(EL1
→EL2
) andsmc
(EL2
→EL3
) instructions - Trapped/privileged instructions (e.g., by the hypervisor)
- System calls via the
AVR
- 16-bit fixed length instructions
- Only accepts one or two operands each
- Typically destination register is also one of the sources
- No register-indexed addressing
- No PC-relative addressing for load/store
- Only relative jumps and calls
- No conditional execution except for branching
- Stack pointer is a dedicated IO register
- Can only be read and written with
in
andout
- Only
push
andpop
instructions can access memory via the stack pointer - Must use a frame pointer for local stack variables
- Can only be read and written with
- No barrel shifter, so can only perform single left or right shifts
- Has
in
andout
instructions for IO - Has
sleep
for entering power saving modes - Has a built-in watchdog, which is reset with
wdr
ARM
- 32-bit fixed length instructions
- Encoded into certain categories:
xxx00...
: Unallocatedxxx100...
: Data processing - immediatexxx101...
: Branch, exception generation and system instructionsxxxx1x0...
: Loads and storesxxxx101...
: Data processing - registerxxxx111...
: Data processing - SIMD and floating point
- Usually accepts three or four operands each
- Destination register can be different to source registers
- Register-indexed addressing available for load/store
- Most frequently used load and store instructions accept an optional register offset
- PC-relative addressing available for load/store/execute
- Most frequently used load, store and branch instructions accept a PC-relative offset
adr
calculates the absolute address from a PC-relative one and puts it in a register when needed for other purposes
- Conditional execution of certain instructions
- See section Registers for what instructions and conditions are available
- The condition is accepted as a special operand
- The instruction is only executed if the condition is true, or otherwise treated as a no-op, avoiding branching overheads
- Stack pointer (
sp
) is a special purpose register, restricted to a few instructions:- No dedicated push/pop instructions
- Load/Store instructions can use it as the base address
- Push/Pop can be done with these instructions with post/pre-indexing
- Add/Subtract instructions can use it as a source operand, destination, or both
- Copying it to another register is done by adding
sp
with#0
(which is aliased by a specialmov
instruction)
- Copying it to another register is done by adding
- Logical instructions in immediate form accepts it as a destination operand
- Possible to not use a frame pointer at all with this design
- Has a barrel shifter, so multiple left and right shifts can be done in a single instruction without any performance penalty
- All IO is done exclusively through memory mapped IO (Device memory. See Memory Model)
- Thus, IO can only be done by the load and store instructions, as if it were normal memory
- Has
wfe
andwfi
for entering power saving modes - Does not require a built-in watchdog
- If a watchdog is added, however, resetting would be through an implementation-defined IO write
- FP, DSP, SIMD and cryptography instructions are available
Note: Considering only the fundamental C types
AVR
- Word size is 8 bits
- Minor exception for register pairs
r31:r30
,r29:r28
,r27:r26
andr25:r24
where 16-bit addition and subtraction by an immediate value from 0 to 63 can be done in a single instruction
- Minor exception for register pairs
- Two's complement is used for signed integers
- No FPU, so floating point types have to be emulated
- Implements only
[u]int8_t
natively. Other types are emulated
ARM
- Word size is 64 bits
- Two's complement is used for signed integers
- SIMD FPU is available, with half, single and double precision floating-point numbers supported
- All fundamental C types can be implemented natively, and more
- 128-bit vectors and vector operations are available both in integer and floating-point mode
- All fundamental C types can be inserted into the vectors
Since 64-bit integers are supported natively in AArch64, I will instead implement this in AArch32 where 32-bit integers aren't supported natively, in the spirit of the question.
Note: I would have done 128-bit integers if AArch64 had a 64-bit version of umull
, but it doesn't, and writing one is not fun.
Addition can easily be done with the adc
instruction
For example (Assuming numbers are in r1:r0
and r3:r2
, and output in r5:r4
):
add r4, r0, r2
adc r5, r1, r3
Subtraction can use the same code as above, but with the number to subtract as negative with two's complement.
Alternatively, the following can be used:
sub r4, r0, r2
sbc r5, r1, r3
If inputs are a:b = (a * 2^32 + b)
and c:d = (c * 2^32 + d)
, the low 64 bits of the product will be (b * d) + (a * d + b * c) * 2^32
.
That means b * d
needs to return a 64-bit result, which the umull
instruction does, and have a * d + b * c
added to its upper 32 bits.
For example (Assuming numbers are in r1:r0
and r3:r2
, and output in r5:r4
):
umull r4, r5, r0, r2 // high:low = b * d
mla r5, r1, r2, r5 // high = a * d + high
mla r5, r0, r3, r5 // high = b * c + high
Before starting this assignment, I knew very little about ARM, and having to compare it to AVR made me think it was more similar to AVR than I thought.
However, after finishing this assignment, I have discovered that ARM is a lot more similar to x86-64 than AVR, hence why it is commonly said to be a competitor to Intel.
For example, ARM would be complete overkill for simple mechatronics projects where interfacing with inputs and outputs at (relatively) low speeds without much data processing is all that is needed, which is why AVR is often used for these projects.
Where ARM would shine, however, would be in their current main use in smartphones where privileged separation and multitasking are important, and where the few-MHz speeds of AVR is definitely not enough.
Other possible use cases would be for DSP applications (e.g., telecommunications), possibly by itself or as a coprocessor to a FPGA or ASIC, to handle large amount of data at high speeds.
The lack of a built-in watchdog for ARM (arguably the only feature AVR has that ARM doesn't) likely points to the fact that ARM was designed for a more general purpose use rather than for embedded systems.
Don't use ARM if you:
- Need very low power consumption
- Due to all the features and speed that ARM packs in, it is unlikely to consume less power than the simpler AVR
- Only do DSP, FP or cryptography operations infrequently
- Having slow but infrequent DSP/FP/crypto on AVR usually greatly outweighs the cost of switching to ARM
and have no need for:
- Heavy or large number crunching or high performance
- AVR only supports up to 8-bit integers natively, and no SIMD
- High processor frequencies
- AVR only goes up to a few MHz, while ARM can go up to a few GHz
- Privileged separation of running code
- AVR has only a single CPU execution mode
- Multitasking
Otherwise, ARM may be a possibility for your needs
Wikipedia, 2015, ARM architecture. 2015. Available from: http://en.wikipedia.org/wiki/ARM_architecture
Grisenthwaite, Richard, 2011, ARMv8 Technology Preview. Presentation. 2011.
ARM Architecture Reference Manual; ARMv8, for ARMv8-A architecture profile, 2013. A.a. ARM Holdings.
Wikipedia, 2015, ARM Cortex-A57. 2015. Available from: http://en.wikipedia.org/wiki/ARM_Cortex-A57
ARM Holdings, 2014, Cortex-A57 Processor - ARM. 2014. Available from: http://www.arm.com/products/processors/cortex-a/cortex-a57-processor.php
ARM ® Cortex ® -A57 MPCore Processor Technical Reference Manual, 2014. r1p3. ARM Holdings.
Atmel ATmega640/V-1280/V-1281/V-2560/V-2561/V, 2014. 2549Q-02/2014. Atmel Corporation.