cc32d9/KV_maps_in_EOSIO_discussion.txt

## KV_maps_in_EOSIO_discussion.txt
From EOSIO Developers Chat: https://t.me/joinchat/0uhWYfXVpPlkNTA1

Jesse - CalEOS.io (caleosblocks), [27.09.21 23:27]
So, KV_DATABASE feature uses disk but when that's activated the current multiindex tables are still in RAM?

Jesse - CalEOS.io (caleosblocks), [27.09.21 23:27]
I thought they went together... both to disk

Todd Fleming, [27.09.21 23:51]
What we created:
* kv intrinsics had an argument which would allow contracts to choose between DISK and RAM resource types
* node operators could choose whether DISK really went to disk or not, but the disk support was incomplete
* but, we didn't get around to adding the new DISK resource type to the system contract, so contracts really would only have RAM available short term

What the replacement team did:
* Removed the DISK/RAM option from the kv intrinsics. Noooooo!
* Add the ability to charge end users RAM for kv rows. Noooooo!
* Added an option to store everything on disk (dev preview)

Jesse - CalEOS.io (caleosblocks), [27.09.21 23:53]
Yikes... with my limited understanding of this, it feels like their understanding must've been more limited.

Ivan Kazmenko, [27.09.21 23:54]
For what it's worth, regarding documentation, I found get_kv_table_rows in the https://developers.eos.io API reference, but there's no pointer to what it actually means, should or would or will mean...

Todd Fleming, [27.09.21 23:58]
This is a conceptual overview, written before we nailed down the actual intrinsics: https://github.com/EOSIO/spec-repo/blob/master/esr_key_value_database.md

This is the intrinsic set we created: https://github.com/EOSIO/spec-repo/blob/master/esr_key_value_database_intrinsics.md

Ivan Kazmenko, [28.09.21 00:08]
[In reply to Todd Fleming]
Thanks for the pointer.
Right now, I only understand the "limitations of existing tables" part well.
Will have to read the other part again :) .

João Ianuci, [28.09.21 03:54]
Someone can help me, I am try send many actions to a external contract, many times of a vector size. How to proceed?

cc32d9 | EOS Amsterdam, [28.09.21 07:17]
[In reply to Todd Fleming]
How is the disk storage implemented? Is it different from chainbase in mapped memory?

cc32d9 | EOS Amsterdam, [28.09.21 07:17]
*was

cc32d9 | EOS Amsterdam, [28.09.21 07:20]
[In reply to João Ianuci]
In eosjs, transact takes an array of actions. So you push multiple actions in it. But be aware that there's a time limit on the whole transaction, it needs to finish within 30ms

Martin, [28.09.21 09:41]
[In reply to Todd Fleming]
Maybe off topic but reading all those 'Noooooo!'s do you feel comfortable with where the replacement team is taking things? I really wish we stuck with the original team...

Isaiah B, [28.09.21 09:49]
Anyone here use and want to help me with github pages? I can't figure out why pushing an update leaves it on the standard react create app page but locally its a different site when I run npm run build and start

Todd Fleming, [28.09.21 13:45]
[In reply to cc32d9 | EOS Amsterdam]
Without the option: it was in ram, but charged a different resource. With the option (we didn't quite complete this part before the team switch): rocksdb.

cc32d9 | EOS Amsterdam, [28.09.21 13:46]
Ok, and rocksdb itself is a bit of a disaster in terms of performance

Todd Fleming, [28.09.21 13:47]
I suspect we could have tuned it better if we had the time

Todd Fleming, [28.09.21 13:48]
I knew where the major bottleneck was

cc32d9 | EOS Amsterdam, [28.09.21 13:51]
Caching?

Todd Fleming, [28.09.21 13:55]
The way I implemented undo performed well enough on rodeos, but choked with nodeos's higher frequency of temporary undo sessions. The new team made an attempt at an alternative approach to undo, but that didn't end up performing well either. There's still more approaches to try.


Todd Fleming, [28.09.21 13:58]
They also sacrificed a key trade-off: many things need to remain in RAM. Contract authors need to make that trade-off.

cc32d9 | EOS Amsterdam, [28.09.21 14:07]
[In reply to Todd Fleming]
But undo information could be stored completely in RAM, no need to use the database for that

Todd Fleming, [28.09.21 14:09]
In nodeos, yes. Rodeos has a strong crash guarantee.

Todd Fleming, [28.09.21 14:10]
But there is a potential memory problem if there's 300 consecutive blocks of large data writes.

Todd Fleming, [28.09.21 15:16]
A big advantage to kv::map is its support for compound keys

Todd Fleming, [28.09.21 15:17]
A disadvantage is that it's currently only available on test nets

Matt Witherspoon, [28.09.21 16:00]
[In reply to Todd Fleming]
I'm skeptical rocks performance can ever be comparable to chainbase. nodeos running chainbase can bury rodeos easily.

Todd Fleming, [28.09.21 16:01]
[In reply to Matt Witherspoon]
I agree. for nodeos, rocks should be limited to opt-in by contract.

Jesse - CalEOS.io (caleosblocks), [28.09.21 16:54]
Let me ask this… given where things stand today… where many chains are facing limits of scale as far as their servers ram is concerned, and it seems a couple iterations of kv storage have been made… is there any chance we could see what you’ve described Todd?  Which I feel is the only good approach, contract opts into a slower but more affordable/available storage (both on chain and server hardware resource)?

Jesse - CalEOS.io (caleosblocks), [28.09.21 16:55]
It sounds like it could still be possible as no chains have enabled the current system on mainnet, so possible to remove the “everything in ram, no DISK resource” implementation… which feels like the right thing to do.

Todd Fleming, [28.09.21 16:59]
Could a team come together and do this? Probably. Although there is some risk: the perf achieved so far isn't as good as I'd be comfortable with, even for DISK resource. I suspect we can make the perf better, but there is still a chance of failure.


Aarin Hagerty, [29.09.21 06:09]
Regarding the discussion earlier about the trade off with handling larger state size vs higher throughput in nodeos and the different implementation approaches available there (RocksDB, chainbase), I’ll throw another (admittedly half-baked) candidate into the mix to consider:

tl;dr: First class support for vRAM in the EOSIO protocol

Imagine protocol level changes to support partitioning state into persisted and ephemeral state domains, and allowing contracts to have a mix of them. Each state domain holds an ordered key value store and has a Merkle tree over those key value pairs (actually more complicated than the leafs nodes being just key value pairs in order to properly support range scans but let’s not get into that now), where those roots are integrated into a larger Merkle tree with its root acting as the global state root of the blockchain state. Persisted state domains require the key value pairs in that domain to be stored in the state tracked by full validation nodes. However, full validation nodes that don’t serve other functions (like acting as a BP or API node) don’t have to store the key value pairs within ephemeral state domains other than the key value pairs accessed by contracts in the last N blocks (where N is the smallest value that is still large enough to include the reversible blocks and to go back to a block timestamp that is at least T minutes old); they also have to store the Merkle branches up to the global state root for those key value pairs.

Introduce the concept of an augmented block which includes the key value pairs accessed by smart contracts in that block (with deduplication and not including ones that would be generated by actions processed in that block). Full validation nodes are expected to receive augmented blocks to validate when syncing live. Nodes are expected to strip the augmented data from blocks (leaving only the core block in the block log) after a few months to save space. That means a full validation node may not have augmented blocks available in the network to sync from if it has gone a few months without syncing. It could always replay from the core block log in a mode where it keeps track of all key value pairs across all state domains; but realistically it wouldn’t be able to catch up anyway in that case so it would probably just use a snapshot.

Introduce the concept of an augmented transaction which wraps the core transaction (which is signed by the user) with other data that includes the key value pairs from ephemeral domains that the actions in the transaction access (as determined when speculatively executing the transaction) along with Merkle proofs of these claimed key value pairs up to a single global state root. Nodes upstream of the BP that do speculative execution (e.g. API nodes) are expected to run in a mode where they keep all key value pairs across all state domains available (there is a relaxation of this rule allowed as well which I will mention later). But they can keep the key value pairs of ephemeral domains in a slower data store on disk (e.g. RocksDB could be used here). These nodes could take a core transaction sent by the user, speculatively execute it, turn it into an augmented transaction, and then relay that augmented transaction to peers.

Aarin Hagerty, [29.09.21 06:09]
The BP nodeos (as well as other speculative nodeos instances upstream of the BP nodeos) would be responsible for checking that the global state root claimed in the augmented transaction (along with associated claimed block height) is the one tracked by the node for that block height within its queue of the last N block global state roots that are tracked. It would also validate the Merkle proofs of the augmented transaction. Then it can replace the key value pairs claimed in that augmented transaction with more recent ones overriding it that were created by transactions that were processed by the node since the block height that was claimed for the global state root in the augmented transaction (in reality the data structure for leaf nodes of the Merkle tree is more sophisticated to allow for support for range scans without messing things up with this “overriding” process). Finally, it executes the transaction and pulls key value data within the ephemeral domains requested by the executing contracts from this set of (overridden) key value pairs preferentially. If they do not exist (for example because the changing state caused the logic of the smart contract to lookup a different key than what was done in the speculative run), then the BP could either be configured to pull that key value pair from the disk-backed store containing all ephemeral key value pairs (assuming it was configured to track them like the speculative nodes) or simply reject the transaction. Or a more sophisticated approach could have it send the transaction back to its (horizontally scaled) helper nodes to speculative execute again and generate a new augmented transaction to try again. The idea there would be to keep all state the BP needs to produce blocks within RAM and to not require it to lookup state from disk during that critical single-threaded loop where it processes transactions. The validation of the Merkle trees per each augmented transaction could be done in parallel in multiple threads.

This approach also allows pure validator nodes downstream of the BPs (ones that only need to process blocks and do not need to speculatively execute transactions) to only need to track persisted state domain key value pairs (and the hope is that those are small enough to keep all entirely in RAM) which reduces their state storage requirements considerably and allows them to hopefully be fast enough to keep up with the rest of the network even with much weaker computational resources than the API nodes. Furthermore, one can imagine configuring the API nodes to only track in disk storage the ephemeral state domains for the applications they care about. So that would mean they would not be able to speculatively evaluate and relay a received core transaction which accessed ephemeral state domains that the API node was not configured to track. But this could allow API providers collectively to logically shard API nodes to support a subset of contracts such that the collection of all those API nodes could still hopefully cover all contract transaction execution use cases.
	From EOSIO Developers Chat: https://t.me/joinchat/0uhWYfXVpPlkNTA1

	Jesse - CalEOS.io (caleosblocks), [27.09.21 23:27]
	So, KV_DATABASE feature uses disk but when that's activated the current multiindex tables are still in RAM?

	Jesse - CalEOS.io (caleosblocks), [27.09.21 23:27]
	I thought they went together... both to disk

	Todd Fleming, [27.09.21 23:51]
	What we created:
	* kv intrinsics had an argument which would allow contracts to choose between DISK and RAM resource types
	* node operators could choose whether DISK really went to disk or not, but the disk support was incomplete
	* but, we didn't get around to adding the new DISK resource type to the system contract, so contracts really would only have RAM available short term

	What the replacement team did:
	* Removed the DISK/RAM option from the kv intrinsics. Noooooo!
	* Add the ability to charge end users RAM for kv rows. Noooooo!
	* Added an option to store everything on disk (dev preview)

	Jesse - CalEOS.io (caleosblocks), [27.09.21 23:53]
	Yikes... with my limited understanding of this, it feels like their understanding must've been more limited.

	Ivan Kazmenko, [27.09.21 23:54]
	For what it's worth, regarding documentation, I found get_kv_table_rows in the https://developers.eos.io API reference, but there's no pointer to what it actually means, should or would or will mean...

	Todd Fleming, [27.09.21 23:58]
	This is a conceptual overview, written before we nailed down the actual intrinsics: https://github.com/EOSIO/spec-repo/blob/master/esr_key_value_database.md

	This is the intrinsic set we created: https://github.com/EOSIO/spec-repo/blob/master/esr_key_value_database_intrinsics.md

	Ivan Kazmenko, [28.09.21 00:08]
	[In reply to Todd Fleming]
	Thanks for the pointer.
	Right now, I only understand the "limitations of existing tables" part well.
	Will have to read the other part again :) .

	João Ianuci, [28.09.21 03:54]
	Someone can help me, I am try send many actions to a external contract, many times of a vector size. How to proceed?

	cc32d9 \| EOS Amsterdam, [28.09.21 07:17]
	[In reply to Todd Fleming]
	How is the disk storage implemented? Is it different from chainbase in mapped memory?

	cc32d9 \| EOS Amsterdam, [28.09.21 07:17]
	*was

	cc32d9 \| EOS Amsterdam, [28.09.21 07:20]
	[In reply to João Ianuci]
	In eosjs, transact takes an array of actions. So you push multiple actions in it. But be aware that there's a time limit on the whole transaction, it needs to finish within 30ms

	Martin, [28.09.21 09:41]
	[In reply to Todd Fleming]
	Maybe off topic but reading all those 'Noooooo!'s do you feel comfortable with where the replacement team is taking things? I really wish we stuck with the original team...

	Isaiah B, [28.09.21 09:49]
	Anyone here use and want to help me with github pages? I can't figure out why pushing an update leaves it on the standard react create app page but locally its a different site when I run npm run build and start

	Todd Fleming, [28.09.21 13:45]
	[In reply to cc32d9 \| EOS Amsterdam]
	Without the option: it was in ram, but charged a different resource. With the option (we didn't quite complete this part before the team switch): rocksdb.

	cc32d9 \| EOS Amsterdam, [28.09.21 13:46]
	Ok, and rocksdb itself is a bit of a disaster in terms of performance

	Todd Fleming, [28.09.21 13:47]
	I suspect we could have tuned it better if we had the time

	Todd Fleming, [28.09.21 13:48]
	I knew where the major bottleneck was

	cc32d9 \| EOS Amsterdam, [28.09.21 13:51]
	Caching?

	Todd Fleming, [28.09.21 13:55]
	The way I implemented undo performed well enough on rodeos, but choked with nodeos's higher frequency of temporary undo sessions. The new team made an attempt at an alternative approach to undo, but that didn't end up performing well either. There's still more approaches to try.


	Todd Fleming, [28.09.21 13:58]
	They also sacrificed a key trade-off: many things need to remain in RAM. Contract authors need to make that trade-off.

	cc32d9 \| EOS Amsterdam, [28.09.21 14:07]
	[In reply to Todd Fleming]
	But undo information could be stored completely in RAM, no need to use the database for that

	Todd Fleming, [28.09.21 14:09]
	In nodeos, yes. Rodeos has a strong crash guarantee.

	Todd Fleming, [28.09.21 14:10]
	But there is a potential memory problem if there's 300 consecutive blocks of large data writes.

	Todd Fleming, [28.09.21 15:16]
	A big advantage to kv::map is its support for compound keys

	Todd Fleming, [28.09.21 15:17]
	A disadvantage is that it's currently only available on test nets

	Matt Witherspoon, [28.09.21 16:00]
	[In reply to Todd Fleming]
	I'm skeptical rocks performance can ever be comparable to chainbase. nodeos running chainbase can bury rodeos easily.

	Todd Fleming, [28.09.21 16:01]
	[In reply to Matt Witherspoon]
	I agree. for nodeos, rocks should be limited to opt-in by contract.

	Jesse - CalEOS.io (caleosblocks), [28.09.21 16:54]
	Let me ask this… given where things stand today… where many chains are facing limits of scale as far as their servers ram is concerned, and it seems a couple iterations of kv storage have been made… is there any chance we could see what you’ve described Todd? Which I feel is the only good approach, contract opts into a slower but more affordable/available storage (both on chain and server hardware resource)?

	Jesse - CalEOS.io (caleosblocks), [28.09.21 16:55]
	It sounds like it could still be possible as no chains have enabled the current system on mainnet, so possible to remove the “everything in ram, no DISK resource” implementation… which feels like the right thing to do.

	Todd Fleming, [28.09.21 16:59]
	Could a team come together and do this? Probably. Although there is some risk: the perf achieved so far isn't as good as I'd be comfortable with, even for DISK resource. I suspect we can make the perf better, but there is still a chance of failure.


	Aarin Hagerty, [29.09.21 06:09]
	Regarding the discussion earlier about the trade off with handling larger state size vs higher throughput in nodeos and the different implementation approaches available there (RocksDB, chainbase), I’ll throw another (admittedly half-baked) candidate into the mix to consider:

	tl;dr: First class support for vRAM in the EOSIO protocol

	Imagine protocol level changes to support partitioning state into persisted and ephemeral state domains, and allowing contracts to have a mix of them. Each state domain holds an ordered key value store and has a Merkle tree over those key value pairs (actually more complicated than the leafs nodes being just key value pairs in order to properly support range scans but let’s not get into that now), where those roots are integrated into a larger Merkle tree with its root acting as the global state root of the blockchain state. Persisted state domains require the key value pairs in that domain to be stored in the state tracked by full validation nodes. However, full validation nodes that don’t serve other functions (like acting as a BP or API node) don’t have to store the key value pairs within ephemeral state domains other than the key value pairs accessed by contracts in the last N blocks (where N is the smallest value that is still large enough to include the reversible blocks and to go back to a block timestamp that is at least T minutes old); they also have to store the Merkle branches up to the global state root for those key value pairs.

	Introduce the concept of an augmented block which includes the key value pairs accessed by smart contracts in that block (with deduplication and not including ones that would be generated by actions processed in that block). Full validation nodes are expected to receive augmented blocks to validate when syncing live. Nodes are expected to strip the augmented data from blocks (leaving only the core block in the block log) after a few months to save space. That means a full validation node may not have augmented blocks available in the network to sync from if it has gone a few months without syncing. It could always replay from the core block log in a mode where it keeps track of all key value pairs across all state domains; but realistically it wouldn’t be able to catch up anyway in that case so it would probably just use a snapshot.

	Introduce the concept of an augmented transaction which wraps the core transaction (which is signed by the user) with other data that includes the key value pairs from ephemeral domains that the actions in the transaction access (as determined when speculatively executing the transaction) along with Merkle proofs of these claimed key value pairs up to a single global state root. Nodes upstream of the BP that do speculative execution (e.g. API nodes) are expected to run in a mode where they keep all key value pairs across all state domains available (there is a relaxation of this rule allowed as well which I will mention later). But they can keep the key value pairs of ephemeral domains in a slower data store on disk (e.g. RocksDB could be used here). These nodes could take a core transaction sent by the user, speculatively execute it, turn it into an augmented transaction, and then relay that augmented transaction to peers.

	Aarin Hagerty, [29.09.21 06:09]
	The BP nodeos (as well as other speculative nodeos instances upstream of the BP nodeos) would be responsible for checking that the global state root claimed in the augmented transaction (along with associated claimed block height) is the one tracked by the node for that block height within its queue of the last N block global state roots that are tracked. It would also validate the Merkle proofs of the augmented transaction. Then it can replace the key value pairs claimed in that augmented transaction with more recent ones overriding it that were created by transactions that were processed by the node since the block height that was claimed for the global state root in the augmented transaction (in reality the data structure for leaf nodes of the Merkle tree is more sophisticated to allow for support for range scans without messing things up with this “overriding” process). Finally, it executes the transaction and pulls key value data within the ephemeral domains requested by the executing contracts from this set of (overridden) key value pairs preferentially. If they do not exist (for example because the changing state caused the logic of the smart contract to lookup a different key than what was done in the speculative run), then the BP could either be configured to pull that key value pair from the disk-backed store containing all ephemeral key value pairs (assuming it was configured to track them like the speculative nodes) or simply reject the transaction. Or a more sophisticated approach could have it send the transaction back to its (horizontally scaled) helper nodes to speculative execute again and generate a new augmented transaction to try again. The idea there would be to keep all state the BP needs to produce blocks within RAM and to not require it to lookup state from disk during that critical single-threaded loop where it processes transactions. The validation of the Merkle trees per each augmented transaction could be done in parallel in multiple threads.

	This approach also allows pure validator nodes downstream of the BPs (ones that only need to process blocks and do not need to speculatively execute transactions) to only need to track persisted state domain key value pairs (and the hope is that those are small enough to keep all entirely in RAM) which reduces their state storage requirements considerably and allows them to hopefully be fast enough to keep up with the rest of the network even with much weaker computational resources than the API nodes. Furthermore, one can imagine configuring the API nodes to only track in disk storage the ephemeral state domains for the applications they care about. So that would mean they would not be able to speculatively evaluate and relay a received core transaction which accessed ephemeral state domains that the API node was not configured to track. But this could allow API providers collectively to logically shard API nodes to support a subset of contracts such that the collection of all those API nodes could still hopefully cover all contract transaction execution use cases.