chriseth/assembly.md

## assembly.md

      
    Raw
  

              assembly.md
            
          
    Solidity Assembly

This assembly language tries to achieve several goals:

Programs written in it should be readable, even if the code is generated by a compiler from Solidity.
The translation from assembly to bytecode should contain as few “surprises” as possible.
Control flow should be easy to detect to help in formal verification and optimization.

In order to achieve the first and last goal, assembly provides high-level constructs like for loops, switch statements and function calls. It should be possible to write assembly programs that do not make use of explicit SWAP, DUP, JUMP and JUMPI statements, because the first two obfuscate the data flow and the last two obfuscate control flow. Furthermore, functional statements of the form ‘mul(add(x, y), 7)’ are preferred over pure opcode statements like ‘7 x y add mul’ because in the first form, it is much easier to see which operand is used for which opcode.
The second goal is achieved by introducing a desugaring phase that only removes the higher level constructs in a very regular way and still allows inspecting the generated low-level assembly code. The only non-local operation performed by the assembler is name lookup of user-defined identifiers (functions, variables, …), which follow very simple and regular scoping rules and cleanup of local variables from the stack.
Scoping: An identifier that is declared (label, variable, function, assembly) is only visible in the block where it was declared (including nested blocks inside the current block). It is not legal to access local variables across function borders, even if they would be in scope. Shadowing is allowed, but two identifiers with the same name cannot be declared in the same block. Local variables cannot be accessed before they were declared, but labels, functions and assemblies can. Assemblies are special blocks that are used for e.g. returning runtime code or creating contracts. No identifier from an outer assembly is visible in a sub-assembly.
If control flow passes over the end of a block, pop instructions are inserted that match the number of local variables declared in that block, unless the } is directly preceded by an opcode that does not have a continuing control flow path. The stack height is reduced by the number of local variables regardless of that. This mean that labels in the next block will have the same height as before the block that just ended.
If at the end of a block, the stack is not balanced, a warning is issued, unless the last instruction in the block did not have a continuing control flow path.
Why do we use higher-level constructs like switch, for and functions:
Using switch, for and functions, it should be possible to write complex code without using jump or jumpi manually. This makes it much easier to analyze the control flow, which allows for improved formal verification and optimization.
Furthermore, if manual jumps are allowed, computing the stack height is rather complicated.
The position of all local variables on the stack needs to be known, otherwise neither references to local variables nor removing local variables automatically from the stack at the end of a block will work properly. Because of that, every label that is preceded by an instruction that ends or diverts control flow, should be annotated with the current stack layout. This annotation is performed automatically during the desugaring phase.
Example:

We will follow an example compilation from Solidity to desugared assembly.
We consider the runtime bytecode of the following Solidity program:
contract C {
  function f(uint x) returns (uint y) {
    y = 1
    for (uint i = 0; i < x; i++)
      y = 2 * y;
  }
}
The following assembly will be generated:
{
  mstore(0x40, 0x60) // store the “free memory pointer”
  // function dispatcher
  switch div(calldataload(0), exp(2, 226))
    case 0xb3de648b: {
      let (r,) = f(calldataload(4))
      let ret := $allocate(0x20)
      mstore(ret, r)
      return(ret, 0x20)
    }
    default: { jump(invalidJumpLabel) }
  // memory allocator
  function $allocate(size) -> (pos) {
    pos := mload(0x40)
    mstore(0x40, add(pos, size))
  }
  // the contract function
  function f(x) -> (y) {
    y := 1
    for { let i := 0 } lt(i, x) { i := add(i, 1) } {
      y := mul(2, y)
    }
  }
}
After the desugaring phase:
{
  mstore(0x40, 0x60)
  {
    let $0 := div(calldataload(0), exp(2, 226))
    jumpi($case1, eq($0, 0xb3de648b))
    jump($caseDefault)
    $case1:
    {
      // the function call - we put return label and arguments on the stack
      $ret1 calldataload(4) jump($fun_f)
      $ret1 [r]: // a label with a [...]-annotation resets the stack height
                 // to “current block + number of local variables”. It also
                 // introduces a variable, r:
                 // r is at top of stack, $0 is below (from enclosing block)
      $ret2 0x20 jump($fun_allocate)
      $ret2 [ret]: // stack here: $0, r, ret (top)
      mstore(ret, r)
      return(ret, 0x20)
      // although it is useless, the jump is automatically inserted,
      // since the desugaring process does not analyze control-flow
      jump($endswitch)
    }
    $caseDefault:
    {
      jump(invalidJumpLabel)
      jump($endswitch)
    }
    $endswitch:
  }
  jump($afterFunction)
  $fun_allocate:
  {
    $start[$retpos, size]:
    let pos := 0
    {
      pos := mload(0x40)
      mstore(0x40, add(pos, size))
    }
    swap1 pop swap1 jump
  }
  $fun_f:
  {
    start [$retpos, x]:
    let y := 0
    {
      let i := 0
      $for_begin:
      jumpi($for_end, iszero(lt(i, x)))
      {
        y := mul(2, y)
      }
      $for_continue:
      { i := add(i, 1) }
      jump($for_begin)
      $for_end:
    } // Here, a pop instruction is inserted for i
    swap1 pop swap1 jump
  }
  $afterFunction:
  stop
}
Assembly happens in four stages:


Parsing
Desugaring (removes switch, for and functions)
Opcode stream generation
Bytecode generation

Parsing:


Turn the byte stream into a token stream, discarding C++-style comments (a special comment exists for source references, but we will not explain it here).
Turn the token stream into an AST according to the grammar below
Register identifiers with the block they are defined in (annotation to the AST node) and note from which point on, variables can be accessed.

Grammar:
AssemblyBlock = '{' AssemblyItem* '}'


AssemblyItem =
    Identifier |
    FunctionalAssemblyExpression |
    AssemblyBlock |
    AssemblyLocalDefinition |
    FunctionalAssemblyAssignment |
    AssemblyAssignment |
    AssemblySwitch |
    AssemblyFunctionDefinition |
    AssemblyFor |
    ‘break’ | ‘continue’ |
    SubAssembly | ‘dataSize’ ‘(‘ Identifier ‘)’ |
    LinkerSymbol |
    ‘programSize’ | ‘errorLabel’ | ‘bytecodeSize’ | ‘invalidJumpLabel’ |
    NumberLiteral | StringLiteral | HexLiteral
AssemblyLocalDefinition = 'let' IdentifierOrList ':=' FunctionalAssemblyExpression
FunctionalAssemblyAssignment = IdentifierOrList ':=' FunctionalAssemblyExpression
IdentifierOrList = Identifier | ‘(‘ IdentifierList ‘)’
AssemblyAssignment = '=:' Identifier
LabelDefinition = Identifier ( ‘[‘ ( IdentifierList | NumberLiteral ) ‘]’ )? ‘:’
FunctionalAssemblyExpression = Identifier '(' ( AssemblyItem ( ',' AssemblyItem )* )? ')'
AssemblyFunctionDefinition = ‘function’ Identifier ‘(‘ IdentifierList? ‘)’ ‘->’
    ( ‘(‘ IdentifierList ‘)’ AssemblyBlock
IdentifierList = Identifier ( ‘,’ Identifier)*
AssemblySwitch = ‘switch’ FunctionalAssemblyExpression AssemblyCase*
    ( ‘default’ ‘:’ AssemblyBlock )?
AssemblyCase = ‘case’ FunctionalAssemblyExpression ‘:’ AssemblyBlock
AssemblyFor = ‘for’ ( AssemblyBlock | FunctionalAssemblyExpression)
    FunctionalAssemblyExpression ( AssemblyBlock | FunctionalAssemblyExpression) AssemblyBlock
SubAssembly = ‘assembly’ Identifier AssemblyBlock
LinkerSymbol = ‘linkerSymbol’ ‘(‘ StringLiteral ‘)’
Desugaring

An AST transformation removes for, switch and function constructs. The result is still parseable by the same parser, but it will not use certain constructs. If jumpdests are added that are only jumped to and not continued at, information about the stack content is added, unless no local variables of outer scopes are accessed or the stack height is the same as for the previous instruction.
desugar item: AST -> AST =
match item {
AssemblyFunctionDefinition(‘function’ name ‘(‘ arg1, ..., argn ‘)’ ‘->’ ( ‘(‘ ret1, ..., retm ‘)’ body) ->
  <name>:
  {
    $<name>_start [$retPC, $argn, ..., arg1]:
    let ret1 := 0 ... let retm := 0
    { desugar(body) }
    swap and pop items so that only ret1, ... retn, $retPC are left on the stack
    jump 
  }
AssemblyFor(‘for’ { init } condition post body) ->
  {
    init // cannot be its own block because we want variable scope to extend into the body
    // find I such that there are no labels $forI_*
    $forI_begin:
    jumpi($forI_end, iszero(condition))
    { body }
    $forI_continue:
    { post }
    jump($forI_begin)
    $forI_end:
  }
‘break’ ->
  {
    // find nearest enclosing scope with label $forI_end
    pop all local variables that are defined at the current point
    but not at $forI_end
    jump($forI_end)
  }
‘continue’ ->
  {
    // find nearest enclosing scope with label $forI_continue
    pop all local variables that are defined at the current point
    but not at $forI_continue
    jump($forI_continue)
  }
AssemblySwitch(switch condition cases ( default: defaultBlock )? ) ->
  {
    // find I such that there is no $switchI* label or variable
    let $switchI_value := condition
    for each of cases match {
      case val: -> jumpi($switchI_caseJ, eq($switchI_value, val))
    }
    if default block present: ->
      { defaultBlock jump($switchI_end) }
    for each of cases match {
      case val: { body } -> $switchI_caseJ: { body jump($switchI_end) }
    }
    $switchI_end:
  }
FunctionalAssemblyExpression( identifier(arg1, arg2, ..., argn) ) ->
  {
    if identifier is function <name> with n args and m ret values ->
      {
        // find I such that $funcallI_* does not exist
        $funcallI_return argn  ... arg2 arg1 jump(<name>)
        if the current context is `let (id1, ..., idm) := f(...)` ->
          $funcallI_return [id1, ..., idm]:
        else ->
          $funcallI_return[m - n - 1]:
          turn the functional expression that leads to the function call
          into a statement stream
      }
    else -> desugar(children of node)
  }
default node ->
  desugar(children of node)
}
Opcode stream generation:

During opcode stream generation, we keep track of the current stack height, so that accessing stack variables by name is possible.
codegen item: AST -> opcode_stream =
match item {
AssemblyBlock({ items }) ->
  join(codegen(item) for item in items)
  if last generated opcode has continuing control flow:
    POP for all local variables registered at the block (including variables
    introduced by labels)
    warn if the stack height at this point is not the same as at the start of the block
Identifier(id) ->
  lookup id in the syntactic stack of blocks
  match type of id
    Local Variable ->
      DUPi where i = 1 + stack_height - stack_height_of_identifier(id)
    Label ->
      // reference to be resolved during bytecode generation
      PUSH<bytecode position of label>
    SubAssembly ->
      PUSH<bytecode position of subassembly data>
FunctionalAssemblyExpression(id ( arguments ) ) ->
  join(codegen(arg) for arg in arguments.reversed())
  id (which has to be an opcode, might be a function name later)
AssemblyLocalDefinition(let (id1, ..., idn) := expr) ->
  register identifiers id1, ..., idn as locals in current block at current stack height
  codegen(expr) - assert that expr returns n items to the stack
FunctionalAssemblyAssignment((id1, ..., idn) := expr) ->
  lookup id1, ..., idn in the syntactic stack of blocks, assert that they are variables
  codegen(expr)
  for j = n, ..., i:
  SWAPi where i = 1 + stack_height - stack_height_of_identifier(idj)
  POP
AssemblyAssignment(=: id) ->
  look up id in the syntactic stack of blocks, assert that it is a variable
  SWAPi where i = 1 + stack_height - stack_height_of_identifier(id)
  POP
LabelDefinition(name [id1, ..., idn] :) ->
  JUMPDEST
  // register new variables id1, ..., idn and set the stack height to
  // stack_height_at_block_start + number_of_local_variables
LabelDefinition(name [number] :) ->
  JUMPDEST
  // adjust stack height by +number (can be negative)
NumberLiteral(num) ->
  PUSH<num interpreted as decimal and right-aligned>
HexLiteral(lit) ->
  PUSH32<lit interpreted as hex and left-aligned>
StringLiteral(lit) ->
  PUSH32<lit utf-8 encoded and left-aligned>
SubAssembly(assembly <name> block) ->
  append codegen(block) at the end of the code
dataSize(<name>) ->
  assert that <name> is a subassembly ->
  PUSH32<size of code generated from subassembly <name>>
linkerSymbol(<lit>) ->
  PUSH32<zeros> and append position to linker table
} 

Later extensions: Types for all variables (e.g. 64 bit words or even functions)
Constants / syntactic macros