Skip to content

Instantly share code, notes, and snippets.

@arcolife
Last active October 13, 2020 02:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arcolife/973f0319fa1e4a6dfdb1892c6b50b8f9 to your computer and use it in GitHub Desktop.
Save arcolife/973f0319fa1e4a6dfdb1892c6b50b8f9 to your computer and use it in GitHub Desktop.
Casper Node LRT3 report - Oct 12, 2020

To create

  • memory bloating after stests workloads (gossiper etc)

bugs created

bugs related or opened while debugging with devs

----------------------------------------
                ip  port                                                hex             stake  stake_pct
0    172.85.63.221  7777  017448378f74b316c569f6466b6bf4863c929a30dd59e3...   100000000000000   0.333333
1     172.85.59.70  7777  0123806a5ea3b6ea8b460287a63f4f9cd369461abbdf02...   200000000000000   0.666667
2    172.85.77.211  7777  012249e7d56f03863fc6da3f2e2ebaa221fe876e97e1aa...   300000000000000   1.000000
3     172.85.76.71  7777  01564089b0327cb66c4a39a55c03621afb43598edd9601...   400000000000000   1.333333
4    172.85.83.255  7777  01cacdafc1e37050b4b9bce6834a7f34d4a0a7ec6e3718...   500000000000000   1.666667
5    172.85.92.130  7777  01a30191e0edfda795c46f3306fd1bd9eaf732c63aa763...   600000000000000   2.000000
6      172.51.55.0  7777  0126c4ca463bfffb975c16bb504dfb73c2ca3556087633...   700000000000000   2.333333
7    172.51.61.128  7777  01f61621b42ba1a6b7ad031092d89226074e14ed2f70be...   800000000000000   2.666667
8    172.51.94.194  7777  01a57c18f62d3b21b726b1b0fc5cbb4931fc0039e51c63...   900000000000000   3.000000
9     172.51.86.16  7777  012477fc1166fc2ab33fe442b83e702d329f8ece431d68...  1000000000000000   3.333333
10   172.51.66.159  7777  01993fd7ce7fbeb101fef00a676264f715a55cb8900285...  1100000000000000   3.666667
11   172.51.76.219  7777  01d43b5936675e20d8310ead32c231dd21f126eb2b257c...  1200000000000000   4.000000
12  172.142.66.161  7777  0110caecda77ed48374889fbab3b23051dbeed5e0154eb...  1300000000000000   4.333333
13   172.142.65.48  7777  01d9efcab2ea41b63baae674fd883b8fb49a74d424e1be...  1400000000000000   4.666667
14    172.142.50.8  7777  018ba1e48925a48743c7a7ba81f2b7b16eaa1f9878e9a4...  1500000000000000   5.000000
15  172.142.56.109  7777  0165f450c0096451b9b337aa01f13c9e8738d352ded13c...  1600000000000000   5.333333
16   172.142.82.82  7777  018b230de03f532e380c782815e23d8891def6ef838dcb...  1700000000000000   5.666667
17  172.142.93.130  7777  0103fa078970f8f2b73a3db2af246573221ad673b9f37b...  1800000000000000   6.000000
18   172.96.86.203  7777  01ad8cac60e805f6488b0b270540bb23d61cbbe4eff076...  1900000000000000   6.333333
19    172.96.82.84  7777  018c3c5185fcd0838df84e7b4ea2091cb76b1ad74e7ecd...  2000000000000000   6.666667
20    172.96.67.60  7777  0122322b8303abc6b572958b4e09a664e10a5f8572cdcc...  2100000000000000   7.000000
21   172.96.73.159  7777  01f2124a331a91d60ea03b873cb0fee34185ac74825461...  2200000000000000   7.333333
22    172.96.52.96  7777  01da733017a213fa6fb25b065201b8c79e67b8b35947bb...  2300000000000000   7.666667
23   172.96.52.130  7777  016b2fdcaf5701a054ee0720763fc548a309195be796ea...  2400000000000000   8.000000
----------------------------------------

Setup

  • Version - Oct 2 git commit on master - f1f284910244fb20a4a83488634177fd1e586228
  • chainspec defaults:
era_duration = '30seconds'
minimum_era_height = 10
booking_duration = '10days'
entropy_duration = '3hours'
voting_period_duration = '2days'
finality_threshold_percent = 12
minimum_round_exponent = 12

Run

  • stopped 3 nodes for restart tests: indexed 0, 1, 23 in _nodes_stake_distribution.md above.
172.85.59.70 - pre-existing with TRACE
172.85.63.221 - pre-existing with TRACE
172.96.52.130 - clear state with TRACE

Conclusion

  • Fails to sync:
    1. Post restart (doesn't pick up sync)
    2. Late bloomer - attempts to catch-up after a couple Era. Synced partially but super slow.

Setup

  • Version - Oct 8 casper-node v0.1.0_3-438-geccab1cf-eccab1cf
  • chainspec defaults:
era_duration = '1500seconds'
minimum_era_height = 100
booking_duration = '10days'
entropy_duration = '3hours'
voting_period_duration = '2days'
finality_threshold_percent = 12
minimum_round_exponent = 14

Run

Deploy tests

WASM-less transfers

$ stests-cache-view-run --run 1 --type wg-100
 Phase / Step     Execution Start Time  Execution End Time  Duration (s)  Deploys             Action    Status 
 R-001            -                     -                         27.073       50                 --  COMPLETE 
 R-001.P-01       -                     -                         26.857       50                 --  COMPLETE 
 R-001.P-01.S-01  -                     -                          0.063        0    create-accounts  COMPLETE 
 R-001.P-01.S-02  -                     -                         26.653       50  fund-run-accounts  COMPLETE 
LRT-03 - WG-100  - Run 1

Full WASM transfers

$ stests-cache-view-run --run 1 --type wg-110
 Phase / Step     Execution Start Time  Execution End Time  Duration (s)  Deploys             Action       Status 
 R-001            -                     --                     14031.505       49                 --  IN_PROGRESS 
 R-001.P-01       -                     --                     14031.474       49                 --  IN_PROGRESS 
 R-001.P-01.S-01  -                     -                          0.053        0    create-accounts     COMPLETE 
 R-001.P-01.S-02  -                     --                     14031.354       49  fund-run-accounts  IN_PROGRESS 
LRT-03 - WG-110  - Run 1

Findings

  1. Repetitive deploy log messages for a single node.

Screenshot from 2020-10-11 08-41-22

--

  1. Full post wasm transfers, majority of the network panicked and only 5 nodes were up.

Apparently the fix that went in 15 days ago casper-network/casper-node#282 may need a follow-up.

Screenshot from 2020-10-11 08-44-26

ec2-uw2-1a-LRT3-benchN1 - 172.142.66.161 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN4 - 172.142.65.48 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN0 - 172.142.50.8 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN3 - 172.142.56.109 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN5 - 172.142.82.82 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ew2-1a-LRT3-benchN2 - 172.96.86.203 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN3 - 172.85.59.70 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN1 - 172.85.77.211 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN4 - 172.85.76.71 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN3 - 172.51.55.0 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN0 - 172.51.61.128 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN5 - 172.51.94.194 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN2 - 172.51.86.16 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN4 - 172.51.66.159 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ase2-1a-LRT3-benchN1 - 172.51.76.219 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-ec1-1a-LRT3-benchN5 - 172.85.83.255 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
ec2-uw2-1a-LRT3-benchN2 - 172.142.93.130 - panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43

ec2-ew2-1a-LRT3-benchN5 - 172.96.82.84 - running
ec2-ew2-1a-LRT3-benchN4 - 172.96.67.60 - running
ec2-ew2-1a-LRT3-benchN1 - 172.96.73.159 - running
ec2-ew2-1a-LRT3-benchN0 - 172.96.52.96 - running
ec2-ec1-1a-LRT3-benchN2 - 172.85.92.130 - running

ec2-ew2-1a-LRT3-benchN3 - 172.96.52.130 - stopped / restart sync super slow - REST metrics started before sync completion 

ec2-ec1-1a-LRT3-benchN0 - 172.85.63.221 - stopped / restart sync tests failed
  1. Even though the stake is about 30% while FTT was 12 (i.e., only 5 of 24 nodes were up), the network's somehow still live:
               port                                                hex             stake                                           lfb_hash  era_id   height                    tstamp  stake_pct
172.85.92.130  7777  01a30191e0edfda795c46f3306fd1bd9eaf732c63aa763...   600000000000000  4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00...   117.0  11891.0  2020-10-11T07:09:34.080Z   2.000000
172.96.82.84   7777  018c3c5185fcd0838df84e7b4ea2091cb76b1ad74e7ecd...  2000000000000000  4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00...   117.0  11891.0  2020-10-11T07:09:34.080Z   6.666667
172.96.67.60   7777  0122322b8303abc6b572958b4e09a664e10a5f8572cdcc...  2100000000000000  4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00...   117.0  11891.0  2020-10-11T07:09:34.080Z   7.000000
172.96.73.159  7777  01f2124a331a91d60ea03b873cb0fee34185ac74825461...  2200000000000000  4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00...   117.0  11891.0  2020-10-11T07:09:34.080Z   7.333333
172.96.52.96   7777  01da733017a213fa6fb25b065201b8c79e67b8b35947bb...  2300000000000000  4f47e205e8da6b0a5bacdb31c36e7377514cf18d05bb00...   117.0  11891.0  2020-10-11T07:09:34.080Z   7.666667
----------------------------------------
Current live stake in network: 30.67

Screenshot from 2020-10-11 10-55-05

Setup

  • Version - Oct 10 casper-node v0.1.0_3-484-ga7d0f324-a7d0f324
  • chainspec defaults:
era_duration = '1500seconds'
minimum_era_height = 100
booking_duration = '10days'
entropy_duration = '3hours'
voting_period_duration = '2days'
finality_threshold_percent = 12
minimum_round_exponent = 14

Run

Sync tests

Had 2 stopped nodes for this test. Failed again with similar results as in Run 1. This time though,

  • one of the nodes repeatedly got a 6/ABRT signal and died.
Oct 12 12:27:21 ec2-ew2-1a-LRT3-benchN3 systemd[1]: casper-node.service: Main process exited, code=killed, status=6/ABRT
Oct 12 12:27:21 ec2-ew2-1a-LRT3-benchN3 systemd[1]: casper-node.service: Failed with result 'signal'.
  • The other node didn't complete sync but started its http server.

Deploy tests

Network = lrt3.
 Generator  ID     Execution Start Time     Execution End Time       Duration (s)  Deploys    Status                  Step 
 WG-100     R-001  2020-10-12T13:45:28.950  2020-10-12T13:45:57.043        28.094       50  COMPLETE                    -- 
 WG-110     R-001  2020-10-12T13:59:37.909  2020-10-12T14:00:04.970        27.061       50  COMPLETE                    -- 
 WG-110     R-002  2020-10-12T14:07:08.123  2020-10-12T17:46:04.775     13136.652       50  COMPLETE                    -- 
  1. wg-100

ERROR on

{"timestamp":"Oct 12 13:46:23.918","level":"ERROR","target":"casper_node::components::deploy_buffer","fields":{"message":"finalized block that hasn't been proposed!"}}
  1. wg-110 as soon as I launched this afterwards, I saw 9 nodes panic with same error as in run 2
panicked at 'should create ro txn: ReadersFull', node/src/components/storage/lmdb_store.rs:60:43
Current live stake in network: 60.67
                stake_pct                                           lfb_hash  era_id  height                    tstamp
172.85.77.211    1.000000  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.85.76.71     1.333333  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.85.92.130    2.000000  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.51.61.128    2.666667  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.51.86.16     3.333333  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.51.66.159    3.666667  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.142.66.161   4.333333  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.142.65.48    4.666667  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.142.50.8     5.000000  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.142.56.109   5.333333  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.142.93.130   6.000000  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.96.86.203    6.333333  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.96.73.159    7.333333  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
172.96.52.96     7.666667  fe672a4ff93b26cc9788d53ba0a446d1c4fa4fcb293a6d...    10.0  1054.0  2020-10-12T14:04:30.592Z
----------------------------------------
  1. Run 3 hammered the network completely and I now have
---------------------------------------- 

Current live stake in network: 27.33
                stake_pct                                           lfb_hash  era_id  height                    tstamp
172.85.77.211    1.000000  0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7...    10.0  1059.0  2020-10-12T14:06:25.280Z
172.85.92.130    2.000000  0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7...    10.0  1059.0  2020-10-12T14:06:25.280Z
172.142.65.48    4.666667  0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7...    10.0  1059.0  2020-10-12T14:06:25.280Z
172.142.93.130   6.000000  0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7...    10.0  1059.0  2020-10-12T14:06:25.280Z
172.96.86.203    6.333333  0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7...    10.0  1059.0  2020-10-12T14:06:25.280Z
172.96.73.159    7.333333  0ff99ab7f512681b0d1512b0e2eddc5730c7eef48d16d7...    10.0  1059.0  2020-10-12T14:06:25.280Z
----------------------------------------

Network is stalled and not moving at this point (as opposed to LFB's movement that was observed in Run 2 at ~30% live stake after wg-110 workloads)

Network partitioning

Impossible atm https://casperlabs.atlassian.net/browse/FEAT-227 ref: https://casperlabs-team.slack.com/archives/CENLF23FG/p1602141096066000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment