Skip to content

Instantly share code, notes, and snippets.

@onggunhao
Last active July 12, 2018 07:52
Show Gist options
  • Save onggunhao/3eb23bdfc22a60106c7ce477cf373517 to your computer and use it in GitHub Desktop.
Save onggunhao/3eb23bdfc22a60106c7ce477cf373517 to your computer and use it in GitHub Desktop.

Overview of the Ethereum Dataset

Hi Sourav,

This is knowledge transfer of lessons I learnt working with Ethereum. I had to figure out a lot of this from scratch, so hopefully this saves you a lot of time.

The current geth node on andromeda is in a local environment so passwords etc will be included in this write-up (see the last page), but please change them as soon as possible for hygiene purposes.

Conceptual overview of dataset

The dataset on andromeda is a copy of Aashish's and Ivica's forked geth node that they used for the MAIAN project.

For this project, we will only be using Geth as an archival node (i.e. to query transaction information, or to get EVM execution using debug_traceTransaction). So there is theoretically no need to use the forked geth node.

While you can start working with the copy of Geth on andromeda immediately, I would recommend reprovisioning a Geth node and syncing a full copy of the Ethereum blockchain (see the current dataset's limitations below). I found myself spending more time fighting the problems with the current dataset (and not having sudo access to andromeda).

The only issue is that re-provisioning a Geth node may take up to a week to sync. You can also look into services like Quiknode (https://quiknode.io), but these cost money.

Limitations of our current dataset

The big thing is that the dataset is actually incomplete. Many of the Merkle tries from genesis block to the 4 millionth block are missing (or have been accidentally pruned). I spent a large amount of time trying to debug an error until I realized that the dataset is only complete from block 4 million onwards.

If I had to do it again, I would have just provisioned an AWS EC2 instance (t2.medium) and attached a 3TB EBS volume, and re-synced the entire Ethereum chain on Geth. It would have allowed me to work remotely (do you have NUS vpn?) as well. However, these things cost money so we're stuck with andromeda for now.

Conceptual overview of the task

Quite frankly, the coding task for this project was straightforward, but the difficulty was in learning the tooling in the Ethereum ecosystem, and where to look for the information needed. If you are already familiar with Ethereum, the EVM, and opcodes then feel free to skip this section.

At a high level, you will first need to enumerate the change each transaction makes to the blockchain. This can be divided into three categories:

1. Normal transactions:

These can be obtained from web3.eth.getBlock(...).transactions. You can look up the web3 API to see the methods on an individual transaction, or my simple python code.

for curBlockNum in range(startBlock,endBlock):

	# Gets current block
	currentBlock = web3.eth.getBlock(curBlockNum, full_transactions=True)
	print("+++++++ Current block: " + str(curBlockNum) + " +++++++++++")

	# Iterates through current block's transactions
	for txn in currentBlock.transactions:

		# fromAddress details
		fromAddr = txn['from']
		fromAddrInitialBalance = getInitialBalance(fromAddr)

		# Shard
		shard = hash(fromAddr) % 50
		# shard = randint(0,49)

		# toAddress details
		toAddr = txn['to']
		if (toAddr): toAddrInitialBalance = getInitialBalance(toAddr)

2. Internal Transactions (i.e. CALL opcode)

In Ethereum, internal transactions (when a transaction to a smart contract triggers an internal transaction from the contract to another address) are not included as part of the web3.eth.getblock(..).transactions data object.

To find these internal transactions, you will need to use Geth (not web3!) to run debug_traceTransaction. In gist this allows you to "replay" the execution of a contract, so you can see what the EVM is doing. You can read more about this method here https://github.com/ethereum/go-ethereum/wiki/Management-APIs#debug_tracetransaction.

There is no python-geth library (as far as I could find) that worked well. Instead, we link up with Geth using the JSON-RPC.

For internal transactions (to the best of my knowledge), they are represented by the CALL opcode. You should refer to the Ethereum yellowpaper (or the much simpler-to-understand beigepaper) to understand how CALL works.

debug_traceTransactions allows you to see the values CALL takes in.

  • stack[-1] is the gas amount
  • stack[-2] is the address the amount is sent to
  • stack[-3] is the amount being sent in wei

You should refer to the yellowpaper for more detail.

		# Gets EVM Trace from debug_traceTransaction
		params = [txnHash]
		payload = {
			"jsonrpc":"2.0",
			"method":"debug_traceTransaction",
			"params":params,
			"id":1
		}
		headers = {'Content-type':'application/json'}
		debugTraceTransaction = session.post(
			'http://localhost:'+rpcport,
			json=payload,
			headers=headers
		)
		transactionTrace = debugTraceTransaction.json()['result']['structLogs']

		# Handler for different EVM Opcodes
		if (transactionTrace):
			for log in transactionTrace:

				if(log['op'] == 'CALL'):
					txnGas = int(log['stack'][-1], 16)
					internalFromAddr = toAddr
					internalToAddr = '0x' + log['stack'][-2][24:64]	# Turn 64 char string into formatted address TODO: refactor into helper methhod
					internalTxnValue = int(log['stack'][-3], 16)

					internalFromAddrInitialBalance = getInitialBalance(internalFromAddr) + txnValue # Note: We add txnValue to cover instances where contract is a "pass through" contract
					internalToAddrInitialBalance = getInitialBalance(internalToAddr)

					# Sanity check for internal transactions
					if (internalFromAddrInitialBalance < internalTxnValue):
						debug_CALL_transactions = True
						debug_transaction = True

					if (debug_CALL_transactions):
						print("====== Hash: " + txnHash)
						print("TxnGas: " + str(txnGas))
						print("Internal fromAddr: " + internalFromAddr)
						print("Internal toAddr: " + internalToAddr)
						print("Internal txnValue: " + str(web3.fromWei(internalTxnValue, 'ether')))
						debug_CALL_transactions = False

3. State Changes (SLOAD, SSTORE)

Similar to CALL, SLOAD loads information from a smart contract's storage, while SSTORE writes it to storage. You will need to enumerate how this mutates the chain. This is fairly similar to CALL in that it takes arguments from the stack and mutates the key-value pairs in the patricia tree.

The code that runs this is untested; one thing I wanted to put more time into was to understand how I could effectively test that my code was picking up the changes correctly.

(Perhaps - checking internal ERC20 token transfers? These are state changes?)

4. Other changes (Selfdestruct, etc)

I did not have enough time to look into this, but perhaps you could look into other types of EVM opcodes that could mutate state. In my mind, one thing I did not look at was Selfdestruct, which would definitely require some sort of transaction ordering.

In particular, I would focus on the following opcodes:

  • DELEGATECALL
  • CALLCODE
  • SELFDESTRUCT or SUICIDE
  • CREATE (???)

Thinking about Transaction Dependence

I originally approached this problem incorrectly, from the perspective of "slotting" transactions into shards. This was conceptually wrong and quite frankly an uninteresting exercise.

Prateek's talk helped to clarify the intent of this exercise, and it's basically to create a DAG of all transactions and their dependence. In gist:

Tx(1) -> Tx(3) -> Tx(2) -> Tx(4)

I was not very familiar with the Python ecosystem, but after some time I found the following two libraries that might help with the creation of a DAG-type data structure:

  • NetworkX: a Python network library, that allows you store data in nodes. However given the size of our dataset I am not sure whether you can hold it in RAM
  • Neo4J: a graph database that will enable you to hold data in nodes and edges and persist it. It has some snazzy visualizations as could help you visualize the transaction dependence.

Common commands

User/Pass

user: daniel pass: triangleChicken1234# <- please change the ASAP once you are logged in.

You may also opt to create a new user on andromeda and cp the geth filestore over to your user's directory.

Log onto remote server

#!/usr/bin/expect -f
spawn ssh -X daniel@andromeda.d2.comp.nus.edu.sg
expect "assword:"
send "7L6vJ*LGMxf5"
interact

Run geth node

#!/bin/bash
/mnt/c/daniel/Test/go-ethereum/build/bin/geth --datadir /mnt/c/daniel/mychainfull/ --rpc --rpcapi "eth,net,web3,debug" --port 9001 --rpcport 9111 --mine --maxpeers 0 --etherbase 0 --ethash.dagdir ethash --latestblock 43340a6d232532c328211d8a8c0fa84af658dbff1f4906ab7a7d4e41f82fe3a3

Mount remote fs on localdir

This is useful if you want to use sublime text or atom to edit the files. I had been using vim on andromeda, and it was not the most productive experience.

#!/usr/bin/expect -f
spawn sshfs daniel@andromeda.d2.comp.nus.edu.sg:/home/daniel /home/daniel/andromedafs
expect "assword:"
send "7L6vJ*LGMxf5"
interact
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment