Skip to content

Instantly share code, notes, and snippets.

What would you like to do?

ARM compared with AVR


This document will be based on the ARMv8-A architecture, mainly focusing on AArch64, and comparing it to ATmega2560's megaAVR architecture.
Thus "ARM" will be used to mean "ARMv8-A AArch64" and "AVR" as "megaAVR".

Microarchitectural details for ARM are taken from the Cortex-A57.


Focusing on the differences between ARM and AVR, ARM has:

  • Deeply out of order, superscalar and pipelined execution pipeline
    • Up to 128 instructions can be in some state of execution simultaneously
  • Dynamic branch prediction
  • Multicore support
  • 32-bit fixed length RISC instruction set
  • Support for register indexed and PC-relative addressing
  • Conditional select and compare
  • FP, DSP, SIMD and cryptography instructions support
  • Hardware virtualisation support
  • Multiple CPU execution modes for privilege separation
  • 31 64-bit general purpose registers
  • Up to 48 bits of addressable virtual and physical memory
    • Top 8 bits of the 64-bit address can be configured for pointer tagging
  • L1 and L2 CPU caches

Memory Model


  • Modified Harvard architecture
    • Modifications are special instructions for reading and writing to program memory
  • Program memory is word (16-bit) addressed
  • Data memory is byte (8-bit) addressed
  • Little endian
  • 17 bits of addressable program memory for a maximum of 256 KiB
  • 16 bits of addressable data memory for a maximum of 64 KiB
  • First 512 bytes of data memory are always mapped to registers:
    • 00 -> 1f: General purpose registers
    • 20 -> 5f: Standard IO registers
    • 60 -> 1ff: Extended IO registers
  • No MMU


  • Combination of Von Neumann and Harvard architecture
    • Shared memory (L3) and L2 cache for both program and data memory (Von Neumann)
    • L1 cache is split into instruction (program) and data caches (Harvard)
    • L3 memory is used for all physical addressing (Von Neumann)
  • Memory is byte (8-bit) addressed
    • Instructions must be aligned to 16 bits, or a Misaligned PC fault is raised
    • Unaligned device memory load/store raises an Alignment fault
    • Unaligned normal memory load/store, however, is usually possible but slow, and only if the processor is configured to allow it
  • Little endian instructions, but memory access can be configured to be either big or little endian
  • 48 bits of addressable physical memory for a maximum of 256 TiB
  • Physical memory is split into:
    • Normal memory
    • Device memory
      • Defined as memory where multiple reads can return different values and writes can cause side effects
      • IO happens here (i.e., memory mapped)
      • Registers are not mapped to memory except...
      • GIC (Global Interrupt Controller) registers and debug registers are mapped here
  • Has a MMU
    • 64-bit addresses
      • Top 8 bits can be configured for pointer tagging, and will be ignored by the MMU if so
    • 49 bits of addressable virtual memory for a maximum of 512 TiB
      • Split into two 48-bit subranges positioned at the bottom and top of the 64-bit address space
      • Top subrange typically used for kernel while bottom for applications
    • Configurable page granularity of 4, 16 or 64 KiB
    • Read/Write/Execute permissions are changeable
    • Further split between "Non-secure state" and "Secure state", where secure state is only accessible from EL3



  • 32 8-bit wide general purpose registers
  • 1 16-bit wide program counter
  • 480 8-bit wide IO registers (with some part of a single 16-bit register, e.g., stack pointer)
    • Status register is here
    • Registers that modify the behaviour of the CPU
    • IO to peripherals (USART, SPI, GPIO, etc.)
    • Other things that cause side effects


  • 31 64-bit wide general purpose registers
  • 1 64-bit wide zero register
  • 1 64-bit wide program counter
  • 22 64-bit or 32-bit wide special purpose registers
    • Registers that hold and save state for the different CPU execution modes
    • Process state register is here (PSTATE). Some notable "subregisters" include:
      • Condition Flags (NZCV)
      • Interrupt Mask Bits (DAIF)
      • Current Exception Level (CurrentEL)
    • Includes 4 64-bit wide stack pointers (one for each CPU execution mode)
  • Numerous 64-bit and 32-bit wide system registers
    • Registers that provide control and status information of CPU features

How the condition flags (NZCV) are used

  • N-bit: Negative condition flag
    • Set to the MSB of the last flag-setting instruction's result
    • If the result is interpreted as a two's complement signed integer, the result was negative if set
  • Z-bit: Zero condition flag
    • Set if the the last flag-setting instruction's result was zero
  • C-bit: Carry condition flag
    • Set if the last flag-setting instruction resulted in a carry condition (e.g., unsigned overflow)
  • V-bit: Overflow condition flag
    • Set if the last flag-setting instruction resulted in a overflow condition (e.g., signed overflow)
  • Typically the integer and floating-point arithmetic and compare/test instructions set/clear these flags
  • These categories of instructions use the flags as input:
    • Conditional branch (i.e., b.cond)
    • Add or subtract with carry (i.e., adc, adcs, sbc, sbcs)
    • Conditional select (i.e., csinc, csneg, csinv, csel, cset, csetm)
    • Conditional compare (i.e., ccmp, ccmn)
  • The following conditional codes are available, and their meaning applied to integers
    • eq/ne: Equal/Not equal (Z == 0/1)
    • cs/cc: Carry set/Carry clear (C == 0/1)
    • mi/pl: Minus/Plus or zero (N == 0/1)
    • vs/vc: Overflow/No overflow (V == 0/1)
    • hi: Unsigned higher (C ==1 && Z == 0)
    • ls: Unsigned lower or same (!(C ==1 && Z == 0))
    • ge: Signed greater than or equal (N == V)
    • lt: Signed less than (N != V)
    • gt: Signed greater than (Z == 0 && N == V)
    • le: Signed less than or equal (!(Z == 0 && N == V))
    • al: Always

The CPU execution modes

  • Called "Exception Levels"
  • 4 in total, from EL0 to EL3
  • Stored in the process state register (PSTATE) and can be accessed via the CurrentEL "subregister"
  • The levels are intended to be used for:
    • EL3: Secure monitor
      • Highest execution mode
      • Has access to the Secure state and programs running in it
      • Basically no restrictions
    • EL2: Hypervisor
      • Only has access to Non-secure state
      • Responsible for switching between virtual machines, defined as comprising of non-secure EL1 and EL0
      • Has controls to trap various operations from the lower execution modes
      • Can set up a second stage MMU to map EL1 memory (IPA: Intermediate Physical Address) to real physical memory
    • EL1: OS kernel and associated functions that are typically described as privileged
      • Can be in either the Secure state or Non-secure state
      • Has access to everything, except the things (often transparently) restricted by the higher execution modes
        • e.g., set up MMU, IRQ handlers, control IO, etc
    • EL0 (unprivileged execution): Applications
      • Has the least privileges of all the levels
      • Cannot modify anything other than its own or shared writeable memory, and general purpose registers
      • All IO and other privileged operations must be performed through a system call to EL1
  • Each execution mode has its own dedicated stack pointer, process state (PSTATE) and expected return address registers
    • Automatically switches to the dedicated registers on an execution mode change
    • Each mode can still access the less privileged modes' registers, however



  • Only way to interrupt execution
  • Only a single priority of IRQs
  • Each interrupt can be masked by setting/clearing certain bits in certain IO registers
  • Interrupts can be enabled/disabled globally by setting/clearing the I flag in the status register
  • Interrupt handlers run under the same context as the currently executing code, except a return address is pushed onto the stack
  • Interrupts are disabled globally upon entering the handler, and re-enabled upon exiting via reti
  • No software triggered interrupts
    • Except writing to an output pin where an interrupt is configured on the pin


  • One way to interrupt execution, out of all the ways known as exceptions in the ARM world
  • Two priorities of IRQs: IRQ and FIQ
    • FIQs are intended to be higher priority than IRQs, but it is explicitly left up to implementation-defined by the standard
  • Each interrupt can be masked by setting/clearing certain bits in certain IO registers
  • IRQs and FIQs can be enabled/disabled globally by clearing/setting the I and F flags in DAIF of PSTATE respectively
    • However, when the target exception level is higher than the current level:
      • If the target level is EL2 or EL3, the interrupts cannot be masked
      • If the target level is EL1, the interrupts can be masked
        • but DAIF cannot be modified by default under EL0
  • Interrupts of a lower target exception level remain pending if current execution is in a higher level
  • The following steps are taken when entering an interrupt:
    1. PSTATE is saved to the target exception level's SPSR_ELx special purpose register
    2. The preferred return address is saved to the ELR_ELx special purpose register
    3. All bits in DAIF are set
    4. Execution moves to the target exception level
    5. The exception level's dedicated stack pointer register is selected for use
  • The following steps are taken when returning from an interrupt with eret:
    1. PC is restored with the contents in ELR_ELx
    2. PSTATE is restored with the contents in SPSR_ELx
      • Since steps 3-5 of entering the interrupt are just modifying PSTATE, this step automatically reverts them
  • No software triggered interrupts
  • Software triggered exceptions, however, do exist. For example:
    • System calls via the svc (EL0EL1), hvc (EL1EL2) and smc (EL2EL3) instructions
    • Trapped/privileged instructions (e.g., by the hypervisor)

Instruction Set


  • 16-bit fixed length instructions
  • Only accepts one or two operands each
    • Typically destination register is also one of the sources
  • No register-indexed addressing
  • No PC-relative addressing for load/store
    • Only relative jumps and calls
  • No conditional execution except for branching
  • Stack pointer is a dedicated IO register
    • Can only be read and written with in and out
    • Only push and pop instructions can access memory via the stack pointer
    • Must use a frame pointer for local stack variables
  • No barrel shifter, so can only perform single left or right shifts
  • Has in and out instructions for IO
  • Has sleep for entering power saving modes
  • Has a built-in watchdog, which is reset with wdr


  • 32-bit fixed length instructions
  • Encoded into certain categories:
    • xxx00...: Unallocated
    • xxx100...: Data processing - immediate
    • xxx101...: Branch, exception generation and system instructions
    • xxxx1x0...: Loads and stores
    • xxxx101...: Data processing - register
    • xxxx111...: Data processing - SIMD and floating point
  • Usually accepts three or four operands each
    • Destination register can be different to source registers
  • Register-indexed addressing available for load/store
    • Most frequently used load and store instructions accept an optional register offset
  • PC-relative addressing available for load/store/execute
    • Most frequently used load, store and branch instructions accept a PC-relative offset
    • adr calculates the absolute address from a PC-relative one and puts it in a register when needed for other purposes
  • Conditional execution of certain instructions
    • See section Registers for what instructions and conditions are available
    • The condition is accepted as a special operand
    • The instruction is only executed if the condition is true, or otherwise treated as a no-op, avoiding branching overheads
  • Stack pointer (sp) is a special purpose register, restricted to a few instructions:
    • No dedicated push/pop instructions
    • Load/Store instructions can use it as the base address
      • Push/Pop can be done with these instructions with post/pre-indexing
    • Add/Subtract instructions can use it as a source operand, destination, or both
      • Copying it to another register is done by adding sp with #0 (which is aliased by a special mov instruction)
    • Logical instructions in immediate form accepts it as a destination operand
    • Possible to not use a frame pointer at all with this design
  • Has a barrel shifter, so multiple left and right shifts can be done in a single instruction without any performance penalty
  • All IO is done exclusively through memory mapped IO (Device memory. See Memory Model)
    • Thus, IO can only be done by the load and store instructions, as if it were normal memory
  • Has wfe and wfi for entering power saving modes
  • Does not require a built-in watchdog
    • If a watchdog is added, however, resetting would be through an implementation-defined IO write
  • FP, DSP, SIMD and cryptography instructions are available

Data Types

Note: Considering only the fundamental C types


  • Word size is 8 bits
    • Minor exception for register pairs r31:r30, r29:r28, r27:r26 and r25:r24 where 16-bit addition and subtraction by an immediate value from 0 to 63 can be done in a single instruction
  • Two's complement is used for signed integers
  • No FPU, so floating point types have to be emulated
  • Implements only [u]int8_t natively. Other types are emulated


  • Word size is 64 bits
  • Two's complement is used for signed integers
  • SIMD FPU is available, with half, single and double precision floating-point numbers supported
  • All fundamental C types can be implemented natively, and more
  • 128-bit vectors and vector operations are available both in integer and floating-point mode
    • All fundamental C types can be inserted into the vectors

Implementing 64-bit integers

Since 64-bit integers are supported natively in AArch64, I will instead implement this in AArch32 where 32-bit integers aren't supported natively, in the spirit of the question.

Note: I would have done 128-bit integers if AArch64 had a 64-bit version of umull, but it doesn't, and writing one is not fun.

Addition and Subtraction

Addition can easily be done with the adc instruction

For example (Assuming numbers are in r1:r0 and r3:r2, and output in r5:r4):

add r4, r0, r2
adc r5, r1, r3

Subtraction can use the same code as above, but with the number to subtract as negative with two's complement.
Alternatively, the following can be used:

sub r4, r0, r2
sbc r5, r1, r3


If inputs are a:b = (a * 2^32 + b) and c:d = (c * 2^32 + d), the low 64 bits of the product will be (b * d) + (a * d + b * c) * 2^32.
That means b * d needs to return a 64-bit result, which the umull instruction does, and have a * d + b * c added to its upper 32 bits.

For example (Assuming numbers are in r1:r0 and r3:r2, and output in r5:r4):

umull r4, r5, r0, r2 // high:low = b * d
mla r5, r1, r2, r5   // high = a * d + high
mla r5, r0, r3, r5   // high = b * c + high

Personal Reflection

Before starting this assignment, I knew very little about ARM, and having to compare it to AVR made me think it was more similar to AVR than I thought.
However, after finishing this assignment, I have discovered that ARM is a lot more similar to x86-64 than AVR, hence why it is commonly said to be a competitor to Intel.

For example, ARM would be complete overkill for simple mechatronics projects where interfacing with inputs and outputs at (relatively) low speeds without much data processing is all that is needed, which is why AVR is often used for these projects.

Where ARM would shine, however, would be in their current main use in smartphones where privileged separation and multitasking are important, and where the few-MHz speeds of AVR is definitely not enough.
Other possible use cases would be for DSP applications (e.g., telecommunications), possibly by itself or as a coprocessor to a FPGA or ASIC, to handle large amount of data at high speeds.

The lack of a built-in watchdog for ARM (arguably the only feature AVR has that ARM doesn't) likely points to the fact that ARM was designed for a more general purpose use rather than for embedded systems.


Don't use ARM if you:

  • Need very low power consumption
    • Due to all the features and speed that ARM packs in, it is unlikely to consume less power than the simpler AVR
  • Only do DSP, FP or cryptography operations infrequently
    • Having slow but infrequent DSP/FP/crypto on AVR usually greatly outweighs the cost of switching to ARM

and have no need for:

  • Heavy or large number crunching or high performance
    • AVR only supports up to 8-bit integers natively, and no SIMD
  • High processor frequencies
    • AVR only goes up to a few MHz, while ARM can go up to a few GHz
  • Privileged separation of running code
    • AVR has only a single CPU execution mode
  • Multitasking

Otherwise, ARM may be a possibility for your needs


Wikipedia, 2015, ARM architecture. 2015. Available from:
Grisenthwaite, Richard, 2011, ARMv8 Technology Preview. Presentation. 2011.
ARM Architecture Reference Manual; ARMv8, for ARMv8-A architecture profile, 2013. A.a. ARM Holdings.
Wikipedia, 2015, ARM Cortex-A57. 2015. Available from:
ARM Holdings, 2014, Cortex-A57 Processor - ARM. 2014. Available from:
ARM ® Cortex ® -A57 MPCore Processor Technical Reference Manual, 2014. r1p3. ARM Holdings.
Atmel ATmega640/V-1280/V-1281/V-2560/V-2561/V, 2014. 2549Q-02/2014. Atmel Corporation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.