Skip to content

Instantly share code, notes, and snippets.

@hayesgm
Last active April 17, 2024 11:25
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save hayesgm/a8c709de188a9ab0928960268f23d518 to your computer and use it in GitHub Desktop.
Save hayesgm/a8c709de188a9ab0928960268f23d518 to your computer and use it in GitHub Desktop.

Overview

To understand how constructors through Solidity works (that is, how do we go from the compiled contract's bin to a live deployed contract with a different deployedBytecode), I took a deep dive into how one simple contract worked.

The contract Simple.sol:

pragma solidity ^0.5.12;

contract Simple {
  function one() public returns (uint) {
    return 1;
  }
}

Creating Contracts and the Ethereum Virtual Machine

To start, a quick primer on how the Ethereum VM works. When you execute code in a smart contract, the contract's deployed bytecode is loaded into a VM (with state based on other values, such as the current block number or contract storage, etc). That deployed bytecode is interpreted as assembly opcodes and, if all goes well, successfully performs the actions you want based on the input CALLDATA. At the end, if you want to say return a value to the caller, you can load state into memory and then call the RETURN instruction.

But how does that deployed bytecode get there in the first place? You might expect that deploying a contract is simply saying "I want a new contract and here is the bytecode." That's close but not actually how it works. When you deploy a contract, you actually give it bytecode, which is run in the VM, and that bytecode is expected to RETURN the new bytecode that should be stored for the contract. It's a little complex, but the goal is that the code in the VM can 1) run initialization, such as setting storage values to initial states, and 2) rewrite the byte-code, e.g. to store immutable values. Throughout this discussion, I'll use the Solidity terms bin (or constructor) and deployedBytecode to refer to the deployment code and its result, respectively.

Contract Compiled Bytecode

So, back to our simpleton contract that doesn't do anything in its constructor. What does its "bin" look like. Here is a full analysis of the code for that constructor. I'll dig in a bit more to the sections, but take a second and breath it in.

00: PUSH1 0x80 -- !! START OF CONSTRUCTOR CODE !!
02: PUSH1 0x40
04: MSTORE        -- m[0x40] = 0x80
05: CALLVALUE     -- [ Revert if Eth sent with transaction
06: DUP1          -- s = [CALLVALUE]
07: ISZERO
08: PUSH1 0x0F
0A: JUMPI
0B: PUSH1 0x00
0D: DUP1
0E: REVERT
0F: JUMPDEST      -- ]
10: POP           -- s = []
11: PUSH1 0x7F    -- s = [0: 0x7F]
13: DUP1          -- s = [0: 0x7F, 1: 0x7F]
14: PUSH2 0x001E  -- s = [0: 0x001E, 1: 0x7F, 2: 0x7F]
17: PUSH1 0x00    -- s = [0: 0x00, 1: 0x001E, 2: 0x7F, 3: 0x7F]
19: CODECOPY      -- M(0x00:0x7F) = I(0x1E:0x9D) Note: s[0] = mem offset, s[1] = code offset, s[2] = code len
1A: PUSH1 0x00    -- s = [0x00, 0x7F]
1C: RETURN        -- RETURN
1D: INVALID   -- !! MARKER FOR START OF DEPLOYED BYTE-CODE BELOW !!
1E: PUSH1 0x80    -- s = [0x80]
20: PUSH1 0x40    -- s = [0x40, 0x80]
22: MSTORE        -- m[0x40] = 0x80
23: CALLVALUE     -- s = [CALLVALUE]
24: DUP1          -- s = [CALLVALUE, CALLVALUE]
25: ISZERO        
26: PUSH1 0x0F
28: JUMPI         -- JD = 0x2D
29: PUSH1 0x00
2B: DUP1
2C: REVERT
2D: JUMPDEST
2E: POP           -- s = []
2F: PUSH1 0x04    -- s = [0x04] (from ContractCompiler.cpp:390)
31: CALLDATASIZE  -- s = [CALLDATASIZE, 0x04]
32: LT            
33: PUSH1 0x28
35: JUMPI         -- if CALLDATASIZE < 4, JUMP: 0x46
36: PUSH1 0x00    -- s = [0x00]
38: CALLDATALOAD  -- s += I[0x00] 
39: PUSH1 0xE0    -- s = [0xE0, I[0x00]]
3B: SHR           -- s = [I[0x00]>>0xE0]
3C: DUP1          -- s = [I[0x00]>>0xE0, I[0x00]>>0xE0]
3D: PUSH4 0x901717D1 -- s = [0x901717D1, I[0x00]>>0xE0, I[0x00]>>0xE0]
42: EQ            -- s = [0x901717D1 == I[0x00]>>0xE0, I[0x00]>>0xE0]
43: PUSH1 0x2D
45: JUMPI         -- if CALLDATA[0,4] == 0x901717D1, JUMP: 0x4B
46: JUMPDEST      -- fallback (simple revert from ContractCompiler.cpp:409)
47: PUSH1 0x00
49: DUP1
4A: REVERT
4B: JUMPDEST      -- one()
4C: PUSH1 0x33    -- s = [0x33, I[0x00]>>0xE0]
4E: PUSH1 0x45    -- s = [0x45, 0x33, I[0x00]>>0xE0]
50: JUMP          -- JUMP: 0x63
51: JUMPDEST      -- Jump from `one()`, stack = [0x01, I[0x00]>>0xE0]
52: PUSH1 0x40    -- s = [0x40, x, ...xs] = [0x40, 0x01, I[0x00]>>0xE0]
54: DUP1          -- s = [0x40, 0x40, x, ...xs] = [0x40, 0x40, 0x01, I[0x00]>>0xE0]
55: MLOAD         -- s = [M[0x40], 0x40, x, ...xs] = [0x80, 0x40, 0x01, I[0x00]>>0xE0]
56: SWAP2         -- s = [x, M[0x40], 0x40, ...xs] = [0x01, 0x80, 0x40, I[0x00]>>0xE0]
57: DUP3          -- s = [0x40, x, M[0x40], 0x40, ...xs] = [0x40, 0x01, 0x80, 0x40, I[0x00]>>0xE0]
58: MSTORE        -- M[0x40] = x (it already was?), s = [M[0x40], 0x40, ...xs]; M[0x40] = 0x01, s = [0x80, 0x40, I[0x00]>>0xE0]
59: MLOAD         -- s = [M[M[0x40]], 0x40, ...xs] = [M[0x80]=0, 0x40, I[0x00]>>0xE0] -- HERE
5A: SWAP1         -- s = [0x40, M[M[0x40]], ...xs] = [0x40, M[0x80]=0, I[0x00]>>0xE0] -- HERE
5B: DUP2          -- s = [M[M[0x40]], 0x40, M[M[0x40]], ...xs] = [M[0x80]=0, 0x40, M[0x80]=0, I[0x00]>>0xE0]
5C: SWAP1         -- s = [0x40, M[M[0x40]], M[M[0x40]], ...xs] = [0x40, M[0x80]=0, M[0x80]=0, I[0x00]>>0xE0]
5D: SUB           -- s = [0x40 - M[M[0x40]], M[M[0x40]], ...xs] = [0x40 - M[0x80]=0, M[0x80]=0, I[0x00]>>0xE0]
5E: PUSH1 0x20    -- s = [0x20, 0x40 - M[M[0x40]], M[M[0x40]], ...xs] = [0x20, 0x40 - M[0x80]=0, M[0x80]=0, I[0x00]>>0xE0]
60: ADD           -- s = [0x20 + 0x40 - M[M[0x40]], M[M[0x40]], ...xs] = [0x60 - M[0x80]=0, M[0x80]=0, I[0x00]>>0xE0]
61: SWAP1         -- s = [M[M[0x40]], 0x20 + 0x40 - M[M[0x40]], ...xs] = [M[0x80]=0, 0x60 - M[0x80]=0, I[0x00]>>0xE0]
62: RETURN        -- Crazy complex way to return data from memory? -- I'm getting something wrong, but nothing too special, overall
63: JUMPDEST      -- Jump from 0x50 for `one()` (Stack: [0x33, I[0x00]>>0xE0])
64: PUSH1 0x01    -- s = [0x01, 0x33, I[0x00]>>0xE0]
66: SWAP1         -- s = [0x33, 0x01, I[0x00]>>0xE0]
67: JUMP          -- JUMP: 0x51
68: INVALID  -- !! MARKER FOR CBOR-ENCODED METADATA !!
69: LOG2
6A: PUSH6 0x627A7A723158
71: SHA3
72: INVALID
73: INVALID
74: SWAP11
75: INVALID
76: PUSH3 0x8CAC64
7A: INVALID
7B: INVALID
7C: DIFFICULTY
7D: PUSH29 0xFA9AED13DE3B93E7D35DE9EC8CD56E2489B8BBA364736F6C6343000510
9B: STOP
9C: ORIGIN

So, the constructor bytecode is composed of three sections. The first section is what runs in the VM when you deploy the contract. Its goal is to copy the deployed bytecode into memory and return it. The second section is the code that will become the deployed bytecode. The third section is CBOR-encoded metadata (including the Swarm hash and Solidity version).

Constructor Code

Diving right into the constructor code:

-- Snip
11: PUSH1 0x7F    -- s = [0: 0x7F]
13: DUP1          -- s = [0: 0x7F, 1: 0x7F]
14: PUSH2 0x001E  -- s = [0: 0x001E, 1: 0x7F, 2: 0x7F]
17: PUSH1 0x00    -- s = [0: 0x00, 1: 0x001E, 2: 0x7F, 3: 0x7F]
19: CODECOPY      -- M(0x00:0x7F) = I(0x1E:0x9D) Note: s[0] = mem offset, s[1] = code offset, s[2] = code len
1A: PUSH1 0x00    -- s = [0x00, 0x7F]
1C: RETURN        -- RETURN

0x7F is the hex-encoded size of the runtime code, including metadata. 0x1E is the first instruction of the deployed bytecode with respect to this bytecode itself. The code above simply sets effectively is calling COPYCODE and telling it to copy from this bytecode [0x1E, 0x1E+0x7F] to memory location 0. Then, we call return telling it the return value is the data in memory at [0x00, 0x7F], which is the runtime bin we just copied into memory. That's it; that's the whole constructor. (Note: we snipped a few lines of code that reverted if we included any Eth as this constructor isn't payable).

The next instruction is interesting:

1D: INVALID   -- !! MARKER FOR START OF DEPLOYED BYTE-CODE BELOW !!

Solidity marks the different segments with an invalid op code simply to make sure code doesn't accidentally execute between the sections. We will utilize this behavior later to quickly separate the different segments.

Runtime Code

Next, we'll take a quick look at the runtime code. As this is likely discussed in other posts, I'll just take a few minor highlights:

-- Snip
                  -- [ This segment gets the four highest-order bytes from the input data
                  --   and compares those to the sha3 hash of each function signature. For
                  --   any match, it jumps to the function definition.

38: CALLDATALOAD      -- s += I[0x00]
39: PUSH1 0xE0        -- s = [0xE0, I[0x00]]
3B: SHR               -- s = [I[0x00]>>0xE0]
3C: DUP1              -- s = [I[0x00]>>0xE0, I[0x00]>>0xE0]
3D: PUSH4 0x901717D1  -- s = [0x901717D1, I[0x00]>>0xE0, I[0x00]>>0xE0]
42: EQ                -- s = [0x901717D1 == I[0x00]>>0xE0, I[0x00]>>0xE0]
43: PUSH1 0x2D
45: JUMPI             -- if CALLDATA[0,4] == 0x901717D1, JUMP: 0x4B
46: JUMPDEST          -- fallback (simple revert from ContractCompiler.cpp:409)
47: PUSH1 0x00
49: DUP1
4A: REVERT        -- ]
4B: JUMPDEST      -- [ Start of function `one()` (note: 0x901717D1 is the first four bytes of the
                  --   sha3 value of the string 'one()'. This is how Solidity determines which function
                  --   you intended to call.
                  --
                  --   Note: This function is effectively written backwards; it first pushes two jump locations
                  --         onto the stack, then it jumps to the first at `0x63` (which pushes the return value 0x01 to the
                  --         stack, and then it jumps to the second (at `0x51`) which moves that stack value
                  --         into memory so it can be `RETURN`ed. Note: the address `0x51` and `0x63` are shown here
                  --         relative to the original bytecode. Once deployed, the offsets are relative to the start
                  --         of the deployed bytecode, which you see in the constants `0x33` and `0x45` below.
                  --         
4C: PUSH1 0x33    -- s = [0x33, I[0x00]>>0xE0]
4E: PUSH1 0x45    -- s = [0x45, 0x33, I[0x00]>>0xE0]
50: JUMP          -- JUMP: 0x63
51: JUMPDEST      -- Jump back from `one()`, stack = [0x01, I[0x00]>>0xE0]
                  -- Snip moving stack pointer into memory for `RETURN`
62: RETURN
63: JUMPDEST      -- Jump from 0x50 for `one()` (Stack: [0x33, I[0x00]>>0xE0])
64: PUSH1 0x01    -- s = [0x01, 0x33, I[0x00]>>0xE0]
66: SWAP1         -- s = [0x33, 0x01, I[0x00]>>0xE0]
67: JUMP          -- JUMP: 0x51

This annotated bytecode shows how Solidity figures out what function is being called, jumps to that function, and then returns the data from that function. Again, this code should be covered in other docs fairly well, and there's nothing particularly special going on.

Right after this code, we again see:

68: INVALID  -- !! MARKER FOR CBOR-ENCODED METADATA !!

This invalid marks the end of the deployed bytecode (and ensures the program counter doesn't overrun the end of the code).

Metadata

Finally, there's a section that shows a gibberish of instructions at the end, starting with:

69: LOG2
6A: PUSH6 0x627A7A723158
71: SHA3

The reason is, this isn't EVM bytecode at all. It's CBOR-encoded metadata providing some information about the contract that Solidity decided to store on-chain. [0] [1]

When we get:

{
  bzzr1: '0x8a8f3c65a125bc24e00953797e30744d7f79b3f85510aed0f969030bb4c88468',
  solc: '0x000510'
}

This is the Swarm Hash of the source file metadata and the Solidity version ([0x05, 0x10] = 5.16).

The bzz hash will change if any single character (even newlines) change in the Solidity source code. We'll come back to this in a moment.

Etherscan Verification

How does Etherscan verification work? Why do you need to pass in your source code files and Solidity version and whether or not you performed optimizations? Let's start with the task. Etherscan wants to make sure it displays the correct code that corresponds to the contracts that were compiled into the deployed bytecode it sees on chain. As Etherscan tracks every transaction, it knows the constructor bytecode and call data used to create the contract. It simply needs to know that the source code and constructor args you put in match what it saw for the contract creation. That is, Etherscan checks to see if the constructor bytecode it gets matches from compiling your contracts matches the constructor bytecode from the contract creation transaction. It also checks to see if the constructor args you pass in match the input data it saw for the transaction.

But, this doesn't always work. Remember how Solidty includes the metadata hash in the constructor (and runtime) bytecode? If we simply did this comparison, we'd see (hard to spot) differences like below:

- 0x6080604052348015600f57600080fd5b506004361060285760003560e01c8063901717d114602d575b600080fd5b60336045565b60408051918252519081900360200190f35b60019056fea265627a7a72315820496d028fc29efe2e00b5e738360c8efc4d04fecf7ec88f97503e0450e32461dd64736f6c63430005100032
+ 0x6080604052348015600f57600080fd5b506004361060285760003560e01c8063901717d114602d575b600080fd5b60336045565b60408051918252519081900360200190f35b60019056fea265627a7a723158208a8f3c65a125bc24e00953797e30744d7f79b3f85510aed0f969030bb4c8846864736f6c63430005100032

The two different deployments have nearly the same code, except for a few different bytes near the end. Ah yes, it's the Swawm Hash. We must have had slightly different source code that is causing the two deployment bytecodes to mismatch. To handle this issue, Etherscan actually removes the Swarm (bzzr) hash before it does a comparison. We can see this here:

Etherscan

The above screenshot from Etherscan shows a contract that failed to verify (I modified the one() function above to return 2-- you can see the bytecode at location 0x64 change from 0x6001 (PUSH1 0x01) to 0x6002 (PUSH1 0x02). But notice that in both code snippets, the swarm hash is replaced by the string {bzzr}, indicating that Etherscan is not comparing those in its check. It declares a match for anything that matches on everything except the Swarm Hash.

Compound Protocol Verification

Lastly, I want to talk about Compound Protocol verification. When a user from the community deploys a contract, we want people to be able to verify if that contract comes from code that has been audited. To do this, we can quickly compare if that code was generated from a release commit of the Compound Protocol repo. We will use a technique similar to the Etherscan verification system above, but we have one advantage: we can catalogue all possible deployments and quickly detect the parameters for a deployment. That is, you don't need to tell the system what version of Solidity or Compound you used; we can figure it out through static code analysis.

On the other side, we don't have every transaction on Ethereum already tracked, and the blockchain does not store the constructor bytecode after it runs the contract creation. So we'll need to work a little harder to detect what constructor bytecode could have generated the given dpeloyed bytecode. But since we now know how Solidity constructors work, we should be able to do this for contracts that play by the rules.

First, we'll compile the source code ourselves in Docker (which will give us the constructor bytecode). The goal of this task is to keep the environment as similar as possible to the time it was compiled by a user, so that we don't accidentally change the bytecode. But once we have the constructor bytecode, we're stuck again, because the blockchain only contains the deployed bytecode, not the constructor bytecode.

Well, remember how the constructor bytecode works? It RETURNs the bytecode that should be stored on-chain for the given contract. That means if we eth_call to the chain with the constructor bytecode through Web3, the return value of that call will be the code that would have been stored on-chain. Great, so now we run the constructor (we can even mock from here to match the original call) and we check to see if the bytecode matches.

To see how this works, we'll go back to the discussion from above where we talked about using INVALID markers to demarcate the different sections of the constructor bytecode. We'll call these a) constructor ops, b) runtime ops, and c) metadata. Remember how the Solidity constructor ops simply returned the runtime and metadata bytecode directly? That means that we can (for any contract that follows this pattern) simply check the contract's bytecode against the runtime / metadata byte code. We'll also need to be careful to avoid trying to match the bzzr hash, but otherwise, for any given Compound Protocol contract at a given release, we can construct a regular expression:

new RegExp("${runtimeBytecode}${metadata.replace(metadata.bzzr, (?:[0-9a-f]{64})})

That is, we can take the runtime bytecode and the metadata bytecode, swapping out the real bzzr with a regular expression for matching any 64 hexidecimals. This regular expression will now match any Ethereum contract that was deployed from this contract. We can also track the constructor args and report out what the arguments were to that deployment. Finally, we can hook this up to a GitHub action that any release tracks this regular expression and stores it in a public repo for others to use, which looks like this:

[
  {
    regex: /6080604052348015600f57600080fd5b506004361060285760003560e01c8063901717d114602d575b600080fd5b60336045565b60408051918252519081900360200190f35b60019056fea265627a7a72315820([0-9a-f]{64})64736f6c63430005100032/,
    contract: "Simple",
    repo: "compound.finance/compound-protocol"
    release: "v2.6",
    commit: "7516300"
  }
]

Thus, you can simply run each regex against a given code to find candidate matches for a given contract. Lastly, instead of trusting the candidates, you can use a Docker image to actually run the bytecode of the candidate, similar to Etherscan, and check to see if the contract does actually match.

Conclusion

This is a long post that goes through Solidity's contract creation code and how it can be used to verify contracts on Etherscan or through static-code analysis. Hopefully this is helpful in better understanding the internals of your contracts and can save you a little time when you're frustrated that your code isn't verifying successfully on Etherscan.

@anggxyz
Copy link

anggxyz commented Mar 14, 2024

incredible resource, thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment