karalabe/txpool.md Secret

## txpool.md

      
    Raw
  

              txpool.md
            
          
    Persistent txpool proposal

Geth's transaction pool is currently in-memory only. Whenever Geth is started up, it starts making network connections to remote peers and builds up a view of the transactions circulating in the network. Local transactions are presisted across restarts, but that's just a quirk, it's not relevant in this discussion.
A purely in-memory pool served us well, but we're hitting the limits on what can be achieved with it. Memory is a scarce resource and the Ethereum network - globally - shuffles more transactions than can fit into the RAM of any meaningfully sized node. Currently Geth nodes by default are limited to maintianing the best (i.e. most expensive) 4096 transactions, but according to txstreet.com, there are 1-2 orders of magnitude more - executable, alas cheap - transactions published in the network.
At first it's not obvious why this is a problem. If Geth maintains 4K best transactions, and a mainnet block fits on average 200-300, everything should be golden. Every new block frees up some space in the pool, that can be taken up by the next best transactions. The catch is not in the pool alone, rather also ow the transactions propagate through the network. Geth nodes deduplicate announcements across network connection, so once a tx crosses a link, it will never be retransmitted. If a transaction is accepted into the pools, but later discarded (outpriced), Geth nodes will actively prevent it from rippling through the network again. This issue gets worse as the network throughput is increased.
The obvious - and flawed - proposal is to introduce some retransmission mechanism. The reason this is an non-viable proposal is because transactions propagations - even the first time - are essentially amplified DoS attacks on the network: a user sends in a small piece of data that gets cascaded across the entire Ethereum network. Propagating it once is acceptable as there is a monetary cost to the sender that prevents blatant sustained SPAM. A retransmission mechanism, however, could be abused to resend the same data over and over an over again, without any additional cost to the original sender. Irrelevant what thorttling algorithm we dream up, as long as we're not tracking every past transaction ever, a malicious attacker can always create a pattern of retransmissions that will keep rotating the same data throughout the network. This happened during the 2015 winter holidays, almost taking down the network.
A second - unrelated - problem with Geth's current transaction pool is that it is overly expensive to have meaningful control over the amount of memory used in total. Transaction byte-sizes vary a lot: small Ether transfers are 200 bytes whereas STARK proofs can go up to 128KB (our current hard limit). There's no inherent difficulty in tracking how much memory the pool uses, the issue arises when a large and expensive transaction attempts to displace tiny cheap ones. An operation that should be O(1) in complexity could turn into O(n), opening us up to DoS attacks. Similarly, when a large transaction would get replaced with a tiny one, the pool would be opened up to an influx of new transactions, leading to a cascade of network packets throughout the entire global P2P network.
The above problem leads to the rationale that Geth must not care about the size of a transaction when deciding to add it to, or drop it from the pool. This becomes problematic when memory is limited but users want large transaction. If Geth allows 256KB transactions and tracks 4K of them, that's 1GB+ or RAM usage. Realistically it never will be that much, but an attacker could push it that high. This prevents us from raising the 4K transaction limit, and also from raising the transaction size limit. Geth currently somewhat hacks around these limitations by neither limiting the pool by transaction count, nor by transaction size; rather by slot count (which is ceil(size / 32KB)). This lands us in a middle ground where we can accept slighly larger transaction at an amortized O(1) operation complexity.
Summarizing the issues above, we need to make Geth's pool more reliable against bursts of transactions / price fluctuations and large transactions. These are at odds with each other as the former requires tracking more transactions whilst the latter tracking fewer. A retransmission mechanism could solve both problems, but we cannot do that safely. As we're out of options, we must break some system invariant that keeps us gridlocked. The only way to store more and larger transactions without ever requiring retransmissions is to introduce a persistent backend to the transaction pool.
Splitting responsibilities

The current architecture of the txpool does not lend itself to easy persistency. We are maintaining multiple nonce-sorted transaction lists per every active account (executable and gapped txs), and also global price-sorted indices (heaps) across all transactions for evictions. Every time a transaction is inserted or dropped from the pool, there's a heavy internal shuffling going on which would be too expensive to either persist to disk or to even require disk reads.
The solution lies in the observation that the txpool is actually serving two purposes combined:

It is storing transactions that can be retrieved by the miner or by remote peers
It is ordering transactions according to most value given to the network (price, nonce)

The above two purposes however have different data availability requirements. Storing the transactions for the miner or rmeote peers obviously requires having the entire transaction stashed away somewhere, but there is no need to instantaneus access to them. A remote peer can wait a bit for us to look it up, and for the miner we can keep the handful of best transactions in memory.
Shuffling the transactions according to different criteria; enforcing pool caps, tracking local transactions, etc. all require blazing fast access to quickly decide which transaction to keep and which to discard. However, these decisions don't need access to the entire transaction blob, only some metadata fields: sender, nonce and gas price.
We can thus refactor the transaction pool to take the above observations into consideration. Instead of shuffling everything in memory and wasting RAM for storing uninteresting data blobs; and instead of shuffling data on disk and killing performance due to constants updates; we can use a persistent database to store the entirety of transactions and only keep a tiny subset of the fields loaded in memory.
Adding up the fields needed for pool scheduling (sender, nonce, gas price) and adding a unique identifier to allow looking up the original transaction (hash), we arrive to a total of less than 100 bytes / tx needed to be in active memory, plus whatever boilerplate the pool adds. Let's say roughly 200B. Keeping the current target of not using more than 128MB RAM for the txpool (4K slots, 32KB each), the same memory would be enough to track 128MB / 200B = 640K transacions, a 160x increase compared to the current capacity.
Just for the sake of completelenss, storing the full data blobs for these 640K stub transactions in the database would be equivalent to 640K * 128KB (max allowed tx size currently) = 82GB, which isn't even that insane of a number for disk storage. Of course, it makes absolutely no sense to track 640K transactions across a network than can include 200tx/block as we would need 640K / 200tx/s * 15s/block = 13.3h to mine all of them with 0 new transactions arriving in the mean time.
A bump from the current 4K to say 16K would alreaady be an immense improvement in the quality of the network, the cost of which would be 3.2MB RAM and 2GB disk space if we assume the worst case scenario of 128KB spam txs. Realistically we could bump the tx size limit up to even 512KB without causing any noticeable disk usage even during the worst attacks.
Devil's in the details

Whilst increasing the transaction pool's capacity with a persistent backend is completely reasonable and workable from the pool's perspective, there are other subsystems that will be affected too and will need some tweaking.
As mentioned above already, it is fine to have rmeote peers wait a bit if they want to retrieve old - but not yet included - transactions, however we don't want the miner to have to wait for disk reads. This can be solved by always keeping a subset (2-3 block's worth) of the best transactions cached in memory.
Currently Geth nodes upon connecting to each other, exchange their entire transaction pool's content hashes so they may retrieve anything unknown. For a pool cap of 4K transactions, the hash set is 128KB in total. If we start to push the limit upwards, this logic will need to be rethought/polished. The suggested 16K tx bump is still within reasonable limits (512KB / connection) so this proposal can be immediately useful without solving this problem. But increasing the pool capacity beyond 16K txs needs this piece solved too. One idea is that if we assume nodes persist their transactions, we could also persist a timestamp of when we've received a transaction. Then nodes could maybe query txs like "I was last online at XYZ, gimme tx hashes newer than that". At the consensus level there's no notion of transaction timestamps of course, and the values would be a bit fuzzy across peers, but as long as we assume peers keep enough txs persisted, I think it could work.
Similar to announcements, we might need to make retrievals a bit smarter too as to prevent nodes from generating an insane amount of traffic while they sync up each others' pools. We could use the same trick that Felix did with the old pool propagation whereby we used a single thread across all peers to do the initial transaction sync. Propagations would still be concurrent, so once you joined the network, you're fully active, but backfilling past transactions is ok to be done "eventually".
One random idea I had is that currently we track the executable and non-executable transactions via heaps in the pool. This in general gives us a O(logN) maintenance time. If we were to bump the number of transactions, we could try replacing the heaps with a doubly-linked list + map indexing individual items. That would result in O(1) complexity, though the global price index might still limit us the same way, so unsure if it's worth the hassle.
Open questions

The most important open question that needs answering is how we can handle the txpool battles (front running, back running, side running, upside down running, etc). Given that we cannot prevent people from battling each other, we need to ensure that it doesn't become a DoS vector. If the persistency proposal writes every transaction to disk, we cannot keep replacing the same nonce hundreds of times per second (or per block). Given that we can't place any per account limits either (users will just use multiple accounts), we need a way to short circuit these micro-lifetime transactions to never hit disk. One idea here is to have a slight delay in the pool, so that we keep recently received transactions in memory and only persist to disk once they are older than say 1-2 blocks. That would avoid both battling txs from hitting disk as well as insta-mined transaction (huge price points).
Implementaion path

TODO once we can't poke holes on the idea