kotarou3/ARM compared with AVR.md

## ARM compared with AVR.md

      
    Raw
  

              ARM compared with AVR.md
            
          
    ARM compared with AVR

Disclaimer

This document will be based on the ARMv8-A architecture, mainly focusing on AArch64, and comparing it to ATmega2560's megaAVR architecture.

Thus "ARM" will be used to mean "ARMv8-A AArch64" and "AVR" as "megaAVR".
Microarchitectural details for ARM are taken from the Cortex-A57.
Overview

Focusing on the differences between ARM and AVR, ARM has:

Deeply out of order, superscalar and pipelined execution pipeline

Up to 128 instructions can be in some state of execution simultaneously


Dynamic branch prediction
Multicore support
32-bit fixed length RISC instruction set
Support for register indexed and PC-relative addressing
Conditional select and compare
FP, DSP, SIMD and cryptography instructions support
Hardware virtualisation support
Multiple CPU execution modes for privilege separation
31 64-bit general purpose registers
Up to 48 bits of addressable virtual and physical memory

Top 8 bits of the 64-bit address can be configured for pointer tagging


L1 and L2 CPU caches

Memory Model

AVR

Modified Harvard architecture

Modifications are special instructions for reading and writing to program memory


Program memory is word (16-bit) addressed
Data memory is byte (8-bit) addressed
Little endian
17 bits of addressable program memory for a maximum of 256 KiB
16 bits of addressable data memory for a maximum of 64 KiB
First 512 bytes of data memory are always mapped to registers:

00 ->  1f: General purpose registers
20 ->  5f: Standard IO registers
60 -> 1ff: Extended IO registers


No MMU

ARM

Combination of Von Neumann and Harvard architecture

Shared memory (L3) and L2 cache for both program and data memory (Von Neumann)
L1 cache is split into instruction (program) and data caches (Harvard)
L3 memory is used for all physical addressing (Von Neumann)


Memory is byte (8-bit) addressed

Instructions must be aligned to 16 bits, or a Misaligned PC fault is raised
Unaligned device memory load/store raises an Alignment fault
Unaligned normal memory load/store, however, is usually possible but slow, and only if the processor is configured to allow it


Little endian instructions, but memory access can be configured to be either big or little endian
48 bits of addressable physical memory for a maximum of 256 TiB
Physical memory is split into:

Normal memory
Device memory

Defined as memory where multiple reads can return different values and writes can cause side effects
IO happens here (i.e., memory mapped)
Registers are not mapped to memory except...
GIC (Global Interrupt Controller) registers and debug registers are mapped here


Has a MMU

64-bit addresses

Top 8 bits can be configured for pointer tagging, and will be ignored by the MMU if so


49 bits of addressable virtual memory for a maximum of 512 TiB

Split into two 48-bit subranges positioned at the bottom and top of the 64-bit address space
Top subrange typically used for kernel while bottom for applications


Configurable page granularity of 4, 16 or 64 KiB
Read/Write/Execute permissions are changeable
Further split between "Non-secure state" and "Secure state", where secure state is only accessible from EL3


Registers

AVR

32 8-bit wide general purpose registers
1 16-bit wide program counter
480 8-bit wide IO registers (with some part of a single 16-bit register, e.g., stack pointer)

Status register is here
Registers that modify the behaviour of the CPU
IO to peripherals (USART, SPI, GPIO, etc.)
Other things that cause side effects


ARM

31 64-bit wide general purpose registers
1 64-bit wide zero register
1 64-bit wide program counter
22 64-bit or 32-bit wide special purpose registers

Registers that hold and save state for the different CPU execution modes
Process state register is here (PSTATE). Some notable "subregisters" include:

Condition Flags (NZCV)
Interrupt Mask Bits (DAIF)
Current Exception Level (CurrentEL)


Includes 4 64-bit wide stack pointers (one for each CPU execution mode)


Numerous 64-bit and 32-bit wide system registers

Registers that provide control and status information of CPU features


How the condition flags (NZCV) are used


N-bit: Negative condition flag

Set to the MSB of the last flag-setting instruction's result
If the result is interpreted as a two's complement signed integer, the result was negative if set


Z-bit: Zero condition flag

Set if the the last flag-setting instruction's result was zero


C-bit: Carry condition flag

Set if the last flag-setting instruction resulted in a carry condition (e.g., unsigned overflow)


V-bit: Overflow condition flag

Set if the last flag-setting instruction resulted in a overflow condition (e.g., signed overflow)


Typically the integer and floating-point arithmetic and compare/test instructions set/clear these flags
These categories of instructions use the flags as input:

Conditional branch (i.e., b.cond)
Add or subtract with carry (i.e., adc, adcs, sbc, sbcs)
Conditional select (i.e., csinc, csneg, csinv, csel, cset, csetm)
Conditional compare (i.e., ccmp, ccmn)


The following conditional codes are available, and their meaning applied to integers

eq/ne: Equal/Not equal (Z == 0/1)
cs/cc: Carry set/Carry clear (C == 0/1)
mi/pl: Minus/Plus or zero (N == 0/1)
vs/vc: Overflow/No overflow (V == 0/1)
hi: Unsigned higher (C ==1 && Z == 0)
ls: Unsigned lower or same (!(C ==1 && Z == 0))
ge: Signed greater than or equal (N == V)
lt: Signed less than (N != V)
gt: Signed greater than (Z == 0 && N == V)
le: Signed less than or equal (!(Z == 0 && N == V))
al: Always


The CPU execution modes


Called "Exception Levels"
4 in total, from EL0 to EL3
Stored in the process state register (PSTATE) and can be accessed via the CurrentEL "subregister"
The levels are intended to be used for:

EL3: Secure monitor

Highest execution mode
Has access to the Secure state and programs running in it
Basically no restrictions


EL2: Hypervisor

Only has access to Non-secure state
Responsible for switching between virtual machines, defined as comprising of non-secure EL1 and EL0
Has controls to trap various operations from the lower execution modes
Can set up a second stage MMU to map EL1 memory (IPA: Intermediate Physical Address) to real physical memory


EL1: OS kernel and associated functions that are typically described as privileged

Can be in either the Secure state or Non-secure state
Has access to everything, except the things (often transparently) restricted by the higher execution modes

e.g., set up MMU, IRQ handlers, control IO, etc


EL0 (unprivileged execution): Applications

Has the least privileges of all the levels
Cannot modify anything other than its own or shared writeable memory, and general purpose registers
All IO and other privileged operations must be performed through a system call to EL1


Each execution mode has its own dedicated stack pointer, process state (PSTATE) and expected return address registers

Automatically switches to the dedicated registers on an execution mode change
Each mode can still access the less privileged modes' registers, however


Interrupts

AVR

Only way to interrupt execution
Only a single priority of IRQs
Each interrupt can be masked by setting/clearing certain bits in certain IO registers
Interrupts can be enabled/disabled globally by setting/clearing the I flag in the status register
Interrupt handlers run under the same context as the currently executing code, except a return address is pushed onto the stack
Interrupts are disabled globally upon entering the handler, and re-enabled upon exiting via reti
No software triggered interrupts

Except writing to an output pin where an interrupt is configured on the pin


ARM

One way to interrupt execution, out of all the ways known as exceptions in the ARM world
Two priorities of IRQs: IRQ and FIQ

FIQs are intended to be higher priority than IRQs, but it is explicitly left up to implementation-defined by the standard


Each interrupt can be masked by setting/clearing certain bits in certain IO registers
IRQs and FIQs can be enabled/disabled globally by clearing/setting the I and F flags in DAIF of PSTATE respectively

However, when the target exception level is higher than the current level:

If the target level is EL2 or EL3, the interrupts cannot be masked
If the target level is EL1, the interrupts can be masked

but DAIF cannot be modified by default under EL0


Interrupts of a lower target exception level remain pending if current execution is in a higher level
The following steps are taken when entering an interrupt:

PSTATE is saved to the target exception level's SPSR_ELx special purpose register
The preferred return address is saved to the ELR_ELx special purpose register
All bits in DAIF are set
Execution moves to the target exception level
The exception level's dedicated stack pointer register is selected for use


The following steps are taken when returning from an interrupt with eret:

PC is restored with the contents in ELR_ELx
PSTATE is restored with the contents in SPSR_ELx

Since steps 3-5 of entering the interrupt are just modifying PSTATE, this step automatically reverts them


No software triggered interrupts
Software triggered exceptions, however, do exist. For example:

System calls via the svc (EL0 → EL1), hvc (EL1 → EL2) and smc (EL2 → EL3) instructions
Trapped/privileged instructions (e.g., by the hypervisor)


Instruction Set

AVR

16-bit fixed length instructions
Only accepts one or two operands each

Typically destination register is also one of the sources


No register-indexed addressing
No PC-relative addressing for load/store

Only relative jumps and calls


No conditional execution except for branching
Stack pointer is a dedicated IO register

Can only be read and written with in and out
Only push and pop instructions can access memory via the stack pointer
Must use a frame pointer for local stack variables


No barrel shifter, so can only perform single left or right shifts
Has in and out instructions for IO
Has sleep for entering power saving modes
Has a built-in watchdog, which is reset with wdr

ARM

32-bit fixed length instructions
Encoded into certain categories:

xxx00...: Unallocated
xxx100...: Data processing - immediate
xxx101...: Branch, exception generation and system instructions
xxxx1x0...: Loads and stores
xxxx101...: Data processing - register
xxxx111...: Data processing - SIMD and floating point


Usually accepts three or four operands each

Destination register can be different to source registers


Register-indexed addressing available for load/store

Most frequently used load and store instructions accept an optional register offset


PC-relative addressing available for load/store/execute

Most frequently used load, store and branch instructions accept a PC-relative offset
adr calculates the absolute address from a PC-relative one and puts it in a register when needed for other purposes


Conditional execution of certain instructions

See section Registers for what instructions and conditions are available
The condition is accepted as a special operand
The instruction is only executed if the condition is true, or otherwise treated as a no-op, avoiding branching overheads


Stack pointer (sp) is a special purpose register, restricted to a few instructions:

No dedicated push/pop instructions
Load/Store instructions can use it as the base address

Push/Pop can be done with these instructions with post/pre-indexing


Add/Subtract instructions can use it as a source operand, destination, or both

Copying it to another register is done by adding sp with #0 (which is aliased by a special mov instruction)


Logical instructions in immediate form accepts it as a destination operand
Possible to not use a frame pointer at all with this design


Has a barrel shifter, so multiple left and right shifts can be done in a single instruction without any performance penalty
All IO is done exclusively through memory mapped IO (Device memory. See Memory Model)

Thus, IO can only be done by the load and store instructions, as if it were normal memory


Has wfe and wfi for entering power saving modes
Does not require a built-in watchdog

If a watchdog is added, however, resetting would be through an implementation-defined IO write


FP, DSP, SIMD and cryptography instructions are available

Data Types

Note: Considering only the fundamental C types
AVR

Word size is 8 bits

Minor exception for register pairs r31:r30, r29:r28, r27:r26 and r25:r24 where 16-bit addition and subtraction by an immediate value from 0 to 63 can be done in a single instruction


Two's complement is used for signed integers
No FPU, so floating point types have to be emulated
Implements only [u]int8_t natively. Other types are emulated

ARM

Word size is 64 bits
Two's complement is used for signed integers
SIMD FPU is available, with half, single and double precision floating-point numbers supported
All fundamental C types can be implemented natively, and more
128-bit vectors and vector operations are available both in integer and floating-point mode

All fundamental C types can be inserted into the vectors


Implementing 64-bit integers

Since 64-bit integers are supported natively in AArch64, I will instead implement this in AArch32 where 32-bit integers aren't supported natively, in the spirit of the question.
Note: I would have done 128-bit integers if AArch64 had a 64-bit version of umull, but it doesn't, and writing one is not fun.
Addition and Subtraction

Addition can easily be done with the adc instruction
For example (Assuming numbers are in r1:r0 and r3:r2, and output in r5:r4):
add r4, r0, r2
adc r5, r1, r3

Subtraction can use the same code as above, but with the number to subtract as negative with two's complement.

Alternatively, the following can be used:
sub r4, r0, r2
sbc r5, r1, r3

Multiplication

If inputs are a:b = (a * 2^32 + b) and c:d = (c * 2^32 + d), the low 64 bits of the product will be (b * d) + (a * d + b * c) * 2^32.

That means b * d needs to return a 64-bit result, which the umull instruction does, and have a * d + b * c added to its upper 32 bits.
For example (Assuming numbers are in r1:r0 and r3:r2, and output in r5:r4):
umull r4, r5, r0, r2 // high:low = b * d
mla r5, r1, r2, r5   // high = a * d + high
mla r5, r0, r3, r5   // high = b * c + high

Personal Reflection

Before starting this assignment, I knew very little about ARM, and having to compare it to AVR made me think it was more similar to AVR than I thought.

However, after finishing this assignment, I have discovered that ARM is a lot more similar to x86-64 than AVR, hence why it is commonly said to be a competitor to Intel.
For example, ARM would be complete overkill for simple mechatronics projects where interfacing with inputs and outputs at (relatively) low speeds without much data processing is all that is needed, which is why AVR is often used for these projects.
Where ARM would shine, however, would be in their current main use in smartphones where privileged separation and multitasking are important, and where the few-MHz speeds of AVR is definitely not enough.

Other possible use cases would be for DSP applications (e.g., telecommunications), possibly by itself or as a coprocessor to a FPGA or ASIC, to handle large amount of data at high speeds.
The lack of a built-in watchdog for ARM (arguably the only feature AVR has that ARM doesn't) likely points to the fact that ARM was designed for a more general purpose use rather than for embedded systems.
Conclusion

Don't use ARM if you:

Need very low power consumption

Due to all the features and speed that ARM packs in, it is unlikely to consume less power than the simpler AVR


Only do DSP, FP or cryptography operations infrequently

Having slow but infrequent DSP/FP/crypto on AVR usually greatly outweighs the cost of switching to ARM


and have no need for:

Heavy or large number crunching or high performance

AVR only supports up to 8-bit integers natively, and no SIMD


High processor frequencies

AVR only goes up to a few MHz, while ARM can go up to a few GHz


Privileged separation of running code

AVR has only a single CPU execution mode


Multitasking

Otherwise, ARM may be a possibility for your needs
References

Wikipedia, 2015, ARM architecture. 2015. Available from: http://en.wikipedia.org/wiki/ARM_architecture

Grisenthwaite, Richard, 2011, ARMv8 Technology Preview. Presentation. 2011.

ARM Architecture Reference Manual; ARMv8, for ARMv8-A architecture profile, 2013. A.a. ARM Holdings.

Wikipedia, 2015, ARM Cortex-A57. 2015. Available from: http://en.wikipedia.org/wiki/ARM_Cortex-A57

ARM Holdings, 2014, Cortex-A57 Processor - ARM. 2014. Available from: http://www.arm.com/products/processors/cortex-a/cortex-a57-processor.php

ARM ® Cortex ® -A57 MPCore Processor Technical Reference Manual, 2014. r1p3. ARM Holdings.

Atmel ATmega640/V-1280/V-1281/V-2560/V-2561/V, 2014. 2549Q-02/2014. Atmel Corporation.