slfritchie/2011-12-07.old-notes Secret

## 2011-12-07.old-notes


Greg's plan for this afternoon:

    handoff 1
    A+??
    riak_core_node_watcher:service_down(riak_kv).
    /usr/lib64/riak/erts-5.8.4/bin/escript msgq.escript /etc/riak/vm.args
    riak_core_node_watcher:service_up(riak_kv, whereis(riak_kv_sup)).

For setting 'handoff_concurrency' value:

I = fun(Num) -> rpc:multicall(lists:sublist([node()|nodes()], Num), application, set_env, [riak_core, handoff_concurrency, 2]), rpc:multicall(lists:sublist([node()|nodes()], Num), riak_core_vnode_manager, force_handoffs, []) end.


%    AllFallbacks = fun(Ring) ->
%         [{N,Fallbacks(N, Ring)} || N <- riak_core_ring:all_members(Ring), lists:member(N, [node()|nodes()])]
%     end.
%
%    rpc:multicall([element(1, N) || N <- lists:sublist(AllFallbacks(element(2, riak_core_ring_manager:get_raw_ring())), 3)], application, set_env, [riak_core, handoff_concurrency, 1]).

ssh -p 2222 -l basho prod-04.XXXer.com

[12/7/11 2:36:21 PM] Greg Burd: msgq.escript
[12/7/11 2:36:37 PM] Greg Burd: /usr/lib64/riak/erts-5.8.4/bin/escript msgq.escript /etc/riak/vm.args
[12/7/11 2:37:17 PM] Greg Burd: basho@prod-08:~$ sudo cluster "grep ERL_MAX_PORTS /etc/riak/vm.args"
[12/7/11 2:37:31 PM] Greg Burd: [sudo] password for basho:
Wed Dec  7 12:37:10 PST 2011
+++ Processing: ........................................................................................................

#--- prod-04:
-env ERL_MAX_PORTS 32768

#--- prod-05:
-env ERL_MAX_PORTS 32768

[...]

------------------

hd(lists:reverse(lists:sort([{ process_info(Pid,message_queue_len), Pid} || Pid <- processes()]))).

OwnedPartitions = fun(Owners, Node) ->
        [P || {P, Owner} <- Owners, Owner =:= Node]
    end.

VnodePids = fun(Node) ->        [Pid || {_,Pid,_,_} <- supervisor:which_children({riak_core_vnode_sup, Node}), is_pid(Pid)]    end.

RunningVnodes = fun(Node) ->
        Pids = VnodePids(Node),
        [rpc:call(Node, riak_core_vnode, get_mod_index, [Pid], 30000) || Pid <- Pids]
    end.

ActivePartitions = fun(Node) ->
        lists:foldl(fun({_,P}, Ps) ->
                                                ordsets:add_element(P, Ps)
                                end, [], RunningVnodes(Node))
    end.

Fallbacks = fun(Node, Ring) ->
        Owners = riak_core_ring:all_owners(Ring),
        Owned = ordsets:from_list(OwnedPartitions(Owners, Node)),
        Active = ordsets:from_list(ActivePartitions(Node)),
        Stopped = ordsets:subtract(Owned, Active),
        ordsets:subtract(Active, Owned)
    end.

--------------------------

[12/7/11 2:42:40 PM] Scott Fritchie: +A 257 sounds like a fun number to experiment with.
[12/7/11 2:43:09 PM] Greg Burd:               {wants_claim_fun, {riak_core_new_claim, new_wants_claim}},
              {choose_claim_fun, {riak_core_new_claim, new_choose_claim}},


--------------------------

http://dl.dropbox.com/u/11999159/Basho/2011-12-06-log1.txt

 rp(lists:sort([{Mem, Pid, Queue, Dict}
                || Pid <- [Pid || {_,Pid,_,_} <- supervisor:which_children(riak_core_vnode_sup)],
                   {memory, Mem} <- [catch process_info(Pid, memory)],
                   {message_queue_len, Queue} <- [catch process_info(Pid, message_queue_len)],
                   {dictionary, Dict} <- [catch process_info(Pid, dictionary)],
                                      is_integer(Mem)])).

lists:reverse(lists:sort([{Queue, Pid}                || Pid <- [Pid || {_,Pid,_,_} <- supervisor:which_children(riak_core_vnode_sup)],                   {memory, Mem} <- [catch process_info(Pid, memory)],                   {message_queue_len, Queue} <- [catch process_info(Pid, message_queue_len)],                   {dictionary, Dict} <- [catch process_info(Pid, dictionary)],                                      is_integer(Mem)])).

--------------------------

basho@prod-04:~$ awk '/# Server: prod-[0123456789]/ { last = 2 } /^10/ && last == 0 { print } { last--}' /etc/hosts
10.28.60.208  prod-04-internal.XXXer.com   raid-0-b
10.28.60.210  prod-05-internal.XXXer.com   raid-0-b
10.28.60.212  prod-06-internal.XXXer.com   RAID0-B
10.28.60.214  prod-07-internal.XXXer.com   RAID0-B
10.28.60.216  prod-08-internal.XXXer.com   RAID0-B
10.28.60.218  prod-09-internal.XXXer.com   RAID0-B
10.28.60.220  prod-10-internal.XXXer.com   Raid0-B
10.28.60.222  prod-11-internal.XXXer.com   Raid0-B
10.28.60.224  prod-12-internal.XXXer.com   RAID0-B
10.28.60.226  prod-13-internal.XXXer.com   RAID5-B
10.28.60.228  prod-14-internal.XXXer.com   RAID0-B
10.28.60.230  prod-15-internal.XXXer.com   RAID0-B
10.28.60.202  prod-16-internal.XXXer.com   --Not-In-Service--
10.28.60.232  prod-17-internal.XXXer.com   raid-5-b
10.28.60.234  prod-18-internal.XXXer.com   raid-5-b
10.28.60.238  prod-19-internal.XXXer.com   RAID0-B
10.28.60.240  prod-20-internal.XXXer.com   RAID0-B
10.28.60.242  prod-21-internal.XXXer.com   RAID0-B
10.28.60.244  prod-22-internal.XXXer.com   raid0-b
10.28.60.246  prod-23-internal.XXXer.com   RAID0-B
10.28.60.248  prod-24-internal.XXXer.com   RAID0-B
10.28.60.250  prod-25-internal.XXXer.com   raid0-b
10.28.60.254  prod-26-internal.XXXer.com   RAID0-B
10.28.60.252  prod-27-internal.XXXer.com   RAID0-A
10.28.233.195 prod-28-internal.XXXer.com   raid0-b
10.28.233.197 prod-29-internal.XXXer.com   raid0-b
10.28.233.199 prod-30-internal.XXXer.com   RAID0-B
10.28.233.201 prod-31-internal.XXXer.com   RAID0-B
10.28.233.203 prod-32-internal.XXXer.com   raid0-b
10.28.233.205 prod-33-internal.XXXer.com   RAID0-B
10.28.233.207 prod-34-internal.XXXer.com   Raid0-B
10.28.233.209 prod-35-internal.XXXer.com   RAID0-B
10.28.60.236  prod-99-internal.XXXer.com

------------

/usr/lib/riak/erts-5.8.4/bin/erl -setcookie riak_XXXer_prod -hidden -name slf$$@127.0.0.1 -remsh riaksearch@10.28.60.212
Erlang R14B03 (erts-5.8.4) [source] [64-bit] [smp:24:24] [rq:24] [async-threads:0] [kernel-poll:false]

Eshell V5.8.4  (abort with ^G)
(riaksearch@10.28.60.212)1> {percept_profile:start("/tmp/slf.percept.prod-06.nopatch"), timer:sleep(5*1000), percept_profile:stop()}.
Starting profiling.
{{ok,#Port<0.76127726>},ok,ok}
(riaksearch@10.28.60.212)2> os:cmd("bzip2 -9v /tmp/slf.percept.prod-06.nopatch").
"  /tmp/slf.percept.prod-06.nopatch: 11.485:1,  0.697 bits/byte, 91.29% saved, 69744612 in, 6072897 out.\n"

----------

* Convoy hunting

In slf.percept.prod-17.a.patch, at times:

    14.5 - 15.1 seconds

delta [ms] time [ms] pid  activity module:function/arity #runnables
0.0260 14505.0890 <0.398.0> inactive gen_server2:process_next_msg/9 3
0.1250 14505.2140 <0.11337.3385> inactive prim_inet:recv0/3 2
263.2790 14768.4930 <0.82.0> active gen_server:loop/6 3
0.0070 14768.5000 <0.91.0> active gen_server:loop/6 4
299.7850 15068.2850 <0.398.0> active gen_server2:process_next_msg/9 5
0.0570 15068.3420 <0.398.0> inactive gen_server2:process_next_msg/9 4
0.3950 15068.7370 <0.11995.3385> active prim_inet:recv0/3 5
0.0330 15068.7700 <0.82.0> inactive gen_server:loop/6 4
0.0800 15068.8500 <0.13398.3385> active prim_inet:accept0/2 5
0.0200 15068.8700 <0.13399.3385> active prim_inet:accept0/2 6
0.0140 15068.8840 <0.13403.3385> active prim_inet:accept0/2 7
.....

    20.3 - 20.4 seconds

delta [ms] time [ms] pid  activity module:function/arity #runnables
0.0290 20303.8780 <0.6542.3> active gen_fsm:loop/7 5
0.0080 20303.8860 <0.371.0> inactive gen_server:loop/6 4
41.4230 20345.3090 <0.13147.3384> inactive prim_file:drv_get_response/1 3
0.0530 20345.3620 <0.19601.3385> active gen_fsm:loop/7 4
0.0450 20345.4070 <0.19549.3385> active mochiweb_http:request/3 5
0.0050 20345.4120 <0.19553.3385> active mochiweb_http:request/3 6

    20.48 - 20.9 seconds

delta [ms] time [ms] pid  activity module:function/arity #runnables
0.0050 20493.0300 <0.19699.3385> inactive mochiweb_http:request/3 6
0.0070 20493.0370 <0.19811.3385> inactive erlang:apply/2 5
386.2050 20879.2420 <0.20033.7> inactive erlang:crc32/2 4
0.0070 20879.2490 <0.19444.3385> active erlang:process_info/2 5
0.0110 20879.2600 <0.398.0> active gen_server2:process_next_msg/9 6
0.0060 20879.2660 <0.82.0> active gen_server:loop/6 7
0.0050 20879.2710 <0.20033.7> active erlang:crc32/2 8
0.0030 20879.2740 <0.23401.3375> active prim_inet:close_pend_loop/2 9
0.0060 20879.2800 <0.8920.4> active gen_fsm:loop/7 10

    25.90 - 26.7 seconds

delta [ms] time [ms] pid  activity module:function/arity #runnables
0.0240 25991.9890 <0.29987.9> inactive gen_fsm:loop/7 3
0.0080 25991.9970 <0.398.0> inactive gen_server2:process_next_msg/9 2
493.1710 26485.1680 <0.91.0> active gen_server:loop/6 3
0.0230 26485.1910 <0.28651.3375> active mochiweb_http:request/3 4
0.0290 26485.2200 <0.92.0> active gen_event:fetch_msg/5 5
0.0080 26485.2280 <0.91.0> inactive gen_server:loop/6 4

In slf.percept.prod-17.b.patch, holy cow, there's bad behavior between
1 and about 8 seconds!!

Each "active" message included below is the first event after the big pause.
Also, pid <0.398.0> is riak_kv_stat.

    1.29 - 2.1 seconds (775 msec)
        <0.398.0>  active	gen_server2:process_next_msg/9
    2.29 - 2.97 sec    (667 msec)
        <0.398.0>	active	gen_server2:process_next_msg/9
    3.00 - 3.87 sec    (859 msec)
        <0.398.0>	active	gen_server2:process_next_msg/9
    4.63 - 4.77 sec    (131 msec)
        <0.398.0>	active	gen_server2:process_next_msg/9
    4.79 - 5.67 sec   (861 msec)
        <0.6542.3>	inactive	gen_server2:cast_msg/1
        WEEEEEIIIRRRRDDD?????
    5.69 - 6.6 sec     (861 msec)
        <0.398.0>	active	gen_server2:process_next_msg/9
    7.14 - 7.46 sec    (310 msec)
        <0.398.0>	active	gen_server2:process_next_msg/9
    7.67 - 8.36 sec    (673 msec)
        <0.9404.3489>	active	prim_file:drv_get_response/1

In slf.percept.prod-17.c.patch, holy cow, even worse: a single pause
of 3+ seconds!!

NOTE: during this time there was a bitcask merge.  !@#$! I didn't save
the Riak log message.  But it happened near the end of the 31 sec
sample window.

delta [ms] time [ms] pid  activity module:function/arity #runnables
0.0080 25637.3680 <0.398.0> active gen_server2:process_next_msg/9 4
0.0050 25637.3730 <0.8920.4> inactive gen_fsm:loop/7 3
0.0100 25637.3830 <0.2472.11> inactive gen_fsm:loop/7 2
0.0070 25637.3900 <0.398.0> inactive gen_server2:process_next_msg/9 1
3187.6570 28825.0470 <0.24.0> active gen_server:loop/6 2
0.0130 28825.0600 <0.30277.3558> inactive gen:do_call/4 1
0.0210 28825.0810 <0.608.0> active gen_server:loop/6 2
0.0070 28825.0880 <0.82.0> active gen_server:loop/6 3
0.0090 28825.0970 <0.666.0> active mi_scheduler:worker_loop/1 4
0.0110 28825.1080 <0.6542.3> active gen_fsm:loop/7 5
0.0080 28825.1160 <0.20033.7> active gen_fsm:loop/7 6
0.0040 28825.1200 <0.665.0> active gen_server:loop/6 7
0.0100 28825.1300 <0.666.0> inactive mi_scheduler:worker_loop/1 6
0.0080 28825.1380 <0.23178.8> active gen_fsm:loop/7 7
0.0040 28825.1420 <0.20.0> active gen_server:loop/6 8
0.0100 28825.1520 <0.608.0> inactive gen:do_call/4 7
0.0230 28825.1750 <0.608.0> active gen:do_call/4 8
0.0040 28825.1790 <0.20.0> inactive gen_server:loop/6 7
0.0070 28825.1860 <0.91.0> active gen_server:loop/6 8
0.0060 28825.1920 <0.665.0> inactive gen_server:loop/6 7
0.0140 28825.2060 <0.23225.3559> active prim_inet:accept0/2 8
0.0060 28825.2120 <0.23228.3559> active prim_inet:accept0/2 9
0.0040 28825.2160 <0.23231.3559> active prim_inet:accept0/2 10
0.0050 28825.2210 <0.23233.3559> active prim_inet:accept0/2 11
0.0040 28825.2250 <0.82.0> inactive gen_server:loop/6 10
0.0020 28825.2270 <0.452.0> active gen_server:loop/6 11
0.0030 28825.2300 <0.24.0> inactive prim_file:drv_get_response/1 10
0.0040 28825.2340 <0.23235.3559> active prim_inet:accept0/2 11
0.0060 28825.2400 <0.23239.3559> active prim_inet:accept0/2 12
0.0060 28825.2460 <0.23244.3559> active prim_inet:accept0/2 13
0.0190 28825.2650 <0.23247.3559> active prim_inet:accept0/2 14
0.0080 28825.2730 <0.23249.3559> active prim_inet:accept0/2 15
0.0070 28825.2800 <0.23253.3559> active prim_inet:accept0/2 16
0.0110 28825.2910 <0.6542.3> inactive gen_fsm:loop/7 15
0.0140 28825.3050 <0.23254.3559> active prim_inet:accept0/2 16
0.0110 28825.3160 <0.23260.3559> active prim_inet:accept0/2 17
0.0090 28825.3250 <0.23259.3559> active prim_inet:accept0/2 18
0.0040 28825.3290 <0.23262.3559> active prim_inet:accept0/2 19
0.0080 28825.3370 <0.371.0> active gen_server:loop/6 20

---------------------------------------

basho@prod-17:~$ cat /home/basho/DO.sh
#!/bin/sh

TIME=$1
export PATH=/usr/lib/riak/erts-5.8.4/bin:$PATH
sh -c "while [ 1 ]; do echo RESTARTING ; date; /tmp/etop -setcookie riak_XXXer_prod -node riaksearch@10.28.60.232 -interval 1 -sort msg_q 2>&1 | egrep -v --line-buffered 'application_master:m| 0 gen_fsm:loop/7' | egrep --line-buffered 'Reds|proc_lib|riak_kv' ; done" &
ETOPPID=$!
sh -c "while [ 1 ]; do date ; sleep 1 ; done" &
DATEPID=$!
sh -c "tail -f /var/log/riak/console.log | egrep --line-buffered '@mi_|/ssd|monitor'" &
TAILPID=$!
sh -c ' while [ 1 ]; do ps 29259 | grep pts | grep -v Ssl; sleep 0.05; done' &
PSPID=$!
iostat -kx 1 $TIME
kill $ETOPPID
kill $DATEPID
kill $TAILPID
kill $PSPID
sleep 1
kill `ps axww | grep etop | grep -v grep | awk '{print $1}'`

---------------------------------------

basho@prod-17:~$ for i in /proc/sys/vm/*dirty*; do echo $i ; cat $i ; echo "" ; done
/proc/sys/vm/dirty_background_bytes
0

/proc/sys/vm/dirty_background_ratio
10

/proc/sys/vm/dirty_bytes
0

/proc/sys/vm/dirty_expire_centisecs
3000

/proc/sys/vm/dirty_ratio
20

/proc/sys/vm/dirty_writeback_centisecs
500

---------------------------------------

Dammit, I *knew* that I was seeing prod-17's disks act much, *much*
slower than prod-15's.  ... There are 3 boxes that have RAID 5
settings, see the 'XXXer-raid.txt' file.

prod-15:

Logical device number 1
   Logical device name                      : RAID0-B
   RAID level                               : 0
   Status of logical device                 : Optimal
   Size                                     : 762382 MB
   Stripe-unit size                         : 256 KB
   Read-cache mode                          : Enabled
   MaxIQ preferred cache setting            : Enabled
   MaxIQ cache setting                      : Disabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
   Partitioned                              : Yes
   Protected by Hot-Spare                   : No
   Bootable                                 : No
   Failed stripes                           : No
   Power settings                           : Disabled

prod-17:

Logical device number 1
   Logical device name                      : raid-5-b
   RAID level                               : 5
   Status of logical device                 : Optimal
   Size                                     : 667090 MB
   Stripe-unit size                         : 256 KB
   Read-cache mode                          : Enabled
   MaxIQ preferred cache setting            : Disabled
   MaxIQ cache setting                      : Disabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
   Partitioned                              : Yes
   Protected by Hot-Spare                   : No
   Bootable                                 : No
   Failed stripes                           : No
   Power settings                           : Disabled

-----

tail -f /var/log/riak/console.log | egrep --line-buffered -v 'exit,normal|riak_search_backend:collect_info_response:55|riak_solr_searcher_wm:malformed_request:82|webmachine error: path|too_many_results,|to_mochijson2| monitor |Problem invoking pre-commit hook|riak_indexed_doc,index,|@mi_|riak_search_kv_hook,precommit|badmatch,.error,closed|error,function_clause.*base64' &

-----

slf.percept.prod-29.bigbacklog2.patch.bz2

16ms pause: 5.2053 - 5.2277 sec
16ms pause: 1.5708 - 1.5932 sec
9ms pause : 11.7316 - 11.7463 sec
          : 16.2913 - 16.2957 sec
23ms pause: 24.0534 - 24.0796 sec

-----

Answer (?) to weird Erlang proc scheduling problem: busy_dist_port.

riak_sysmon was buggy and mis-configured!  ARRGH!

Solution: add +zdbbl 16384 to all nodes, then do rolling restart.

During the rolling restart, run 32 different basho_bench instances,
each aimed at a single Riak cluster member.  Used two shell 1-liner
hacks to monitor the basho_bench data:

* Look for non-zero error rates for GET
* Look for bad max latency values for GET

while [ 1 ]; do date ; for i in tests.prod-*/current/get*csv; do echo `echo $i | sed 's/-internal.*//'` `tail -1 $i`; done | awk '$12 > 0 { print }' | sort -nr -k 11 | head -5 ; sleep 15; done

while [ 1 ]; do date ; for i in tests.prod-*/current/get*csv; do echo `echo $i | sed 's/-internal.*//'` `tail -1 $i`; done | sort -nr -k 10 | head -5 ; sleep 15; done

-----

* Return of the vnode mailbox backups from hell.

Example:

Mon Dec 19 21:18:29 PST 2011
'riaksearch@10.28.233.207': [{riak_kv_vnode,327571},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.60.208': [{riak_kv_vnode,17893},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.233.205': [{riak_kv_vnode,6},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.60.232': [{riak_kv_vnode,2},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.60.218': [{riak_kv_vnode,2},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.60.228': [{riak_kv_vnode,1},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.60.240': [{riak_kv_vnode,1},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.60.220': [{riak_kv_vnode,1},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.233.195': [{riak_kv_vnode,1},{riak_pipe_vnode,0},{riak_search_vnode,0}]
'riaksearch@10.28.233.219': [{riak_kv_vnode,0},{riak_search_vnode,0}]

So, we appear to be having packet drop problems.  If we measure using
"ping", we get one set of data: no packets dropped on the external
interfaces.

root@prod-04:~# sh -c 'FILE=/tmp/slf.capt`date +%s` ; echo Putting tcpdump output in $FILE ; tcpdump -i eth0 -c 200000 -w $FILE ; tcptrace -l $FILE | grep "rexmt data pkts" | head -5 ; echo Pseudo-count of TCP flows with retransmitted packets: ; tcptrace -l $FILE | grep "rexmt data pkts" | grep -v "rexmt data pkts:           0           rexmt data pkts:           0" | wc -l'
Putting tcpdump output in /tmp/slf.capt1324358762
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
200000 packets captured
200012 packets received by filter
0 packets dropped by kernel
     rexmt data pkts:           0           rexmt data pkts:           0
     rexmt data pkts:           0           rexmt data pkts:           0
     rexmt data pkts:           0           rexmt data pkts:           0
     rexmt data pkts:           0           rexmt data pkts:          14
     rexmt data pkts:           0           rexmt data pkts:          16
Pseudo-count of TCP flows with retransmitted packets:
82

... and ...

basho@prod-04:~$ bash -c 'for t in "" -internal; do for i in prod-{0{4,5,6,7,8,9},1{0,1,2,3,4,5,7,8,9},2{0,1,2,3,4,5,6,7,8,9},3{0,1,2,3,4,5,6,7,8,9},40,99}$t.XXXer.com; do sudo ping -i 0.02 -c 50 $i 2>&1; done done' | awk '/PING/ {host = $2} /transmitted/ { print host, $0 }' | grep -v '50 received'
prod-05-internal.XXXer.com 50 packets transmitted, 0 received, 100% packet loss, time 1369ms
prod-06-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 995ms
prod-07-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1005ms
prod-08-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1157ms
prod-10-internal.XXXer.com 50 packets transmitted, 48 received, 4% packet loss, time 1114ms
prod-11-internal.XXXer.com 50 packets transmitted, 48 received, 4% packet loss, time 1030ms
prod-12-internal.XXXer.com 50 packets transmitted, 48 received, 4% packet loss, time 1009ms
prod-13-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1006ms
prod-14-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 993ms
prod-17-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1018ms
prod-20-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1001ms
prod-21-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1004ms
prod-24-internal.XXXer.com 50 packets transmitted, 48 received, 4% packet loss, time 1019ms
prod-26-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1021ms
prod-28-internal.XXXer.com 50 packets transmitted, 49 received, 2% packet loss, time 1025ms
prod-30-internal.XXXer.com 50 packets transmitted, 47 received, 6% packet loss, time 1082ms
prod-32-internal.XXXer.com 50 packets transmitted, 47 received, 6% packet loss, time 1012ms

----

WRONG WRONG WRONG here for doc purposes only!
root@prod-04:~# for i in `seq 1 5`; do sh -c 'ifconfig eth0 ; sleep 3 ; ifconfig eth0' | grep bytes | sed -e 's/.. bytes://g' | awk 'NR == 1 { rx = $1; tx = $4} NR == 2 { printf "rx %.1f Mbit/sec tx %1.f Mbit/sec WRONG WRONG WRONG\n", ($1 - rx) / (1024*1024) / 3, ($4 - tx) / (1024*1024) / 3 }' ; donerx 71.5 Mbit/sec tx 104 Mbit/sec
rx 75.5 Mbit/sec tx 113 Mbit/sec
rx 64.5 Mbit/sec tx 44 Mbit/sec
rx 56.2 Mbit/sec tx 94 Mbit/sec
rx 79.4 Mbit/sec tx 108 Mbit/sec

2011-12-20 update: WWWHHHAAAAAA!  That f'ing awk script is *wrong*.
It's not converting bytes to bits!!!!!

Here's a correct one:

root@prod-04:~# date ; for i in `seq 1 5`; do sh -c 'ifconfig eth0 ; sleep 3 ; ifconfig eth0' | grep bytes | sed -e 's/.. bytes://g' | awk 'NR == 1 { rx = $1; tx = $4} NR == 2 { printf "rx %.1f Mbit/sec tx %1.f Mbit/sec\n", ($1 - rx) * 8 / (1024*1024) / 3, ($4 - tx) * 8 / (1024*1024) / 3 }' ; done
Tue Dec 20 15:18:49 PST 2011
rx 423.0 Mbit/sec tx 488 Mbit/sec
rx 491.6 Mbit/sec tx 566 Mbit/sec
rx 483.2 Mbit/sec tx 519 Mbit/sec
rx 437.1 Mbit/sec tx 495 Mbit/sec
rx 456.5 Mbit/sec tx 489 Mbit/sec

---

root@prod-04:~# grep -i qu /tmp/slf.foo
    105815089 requests sent out
    12900 reassemblies required
        source quenches: 106
        echo requests: 58153
        echo request: 95724
    416815 packets pruned from receive queue because of socket buffer overrun
    2222 packets pruned from receive queue
    1 packets dropped from out-of-order queue because of socket buffer overrun
    Quick ack mode was activated 9682843 times
    413634 times the listen queue of a socket overflowed
    600138 packets directly queued to recvmsg prequeue.
    1780022 bytes directly received in process context from prequeue
    167 packets header predicted and directly queued to user
    4475848 packets collapsed in receive queue due to low socket buffer

watch -d "sh -c 'netstat -s | grep -i queue'"
watch -d "sh -c 'netstat -s | egrep -i retr'"

The "netstat -s" method confirms that there are several hundred
packets being retransmitted by TCP each second, on average.

--

Wed 12-21 early morning:

Set MTU on all eth0 interfaces from 1500 -> 9000.
Then disconnect and reconnect all Erlang nodes.

for i in `awk '{print $1}' /tmp/slf.some-hosts `; do /usr/lib/riak/erts-5.8.4/bin/erl -name slf$$@127.0.0.1 -hidden -setcookie riak_XXXer_prod -remsh riaksearch@$i ; done

length(nodes()).  WARNING: Search for "WOW, that caused some unintentional chaos monkey" string below!!!
lists:usort([{net_kernel:disconnect(Nd), timer:sleep(20), net_kernel:connect(Nd), net_adm:ping(Nd)} || Nd <- nodes()]).
length(nodes()).

---

Wed 12-21:

Change CPU freq governor from 'ondemand' -> 'performance'
See set*expect script for how-to

Experimentation on prod-23 and prod-21

   13  ethtool --show-pause eth0
   14  ethtool --pause autoneg off rx off tx off
   15  ethtool --pause eth0 autoneg off rx off tx off
   16  sudo ethtool --pause eth0 autoneg off rx off tx off
   17  ethtool --show-pause eth0
   19  ethtool --show-ring eth0
   20  xsudo ethtool --set-ring eth0 rx 4096 tx 4096 rx-mini 256 rx-jumbo 256
   21  sudo ethtool --set-ring eth0 rx 4096 tx 4096 rx-mini 256 rx-jumbo 256
   22  sudo ethtool --set-ring eth0 rx 4096 tx 4096
   24  ethtool --show-ring eth0

Nothing really seems to help.

Except that Matt Ranney & his crew changed the method that the
timeline history is managed, shrinking the single key's max size for
appending history info (originally 60KB?) by half?  Made a roughly
100Mbit/sec average decrease on all machines.

---

Thu 12-22

Matt changed the timeline history max size smaller again.  Nice win.

Also, after:

    "Matt R.	Man, it feels really good to be finally making some
		progress. All it took was spending an entire day yelling at
		SoftLayer."

... we learned that SL's explanation is that despite average
utilization @ 300Mbit/sec, there are microburst periods where other
cluster members coincidentally burst traffic to a single receiver
above line rate and beyond the switch's very limited capacity to
buffer.  So, during those very small time periods, yes, packets are
dropped.  And yes, the 1-sec average utilization rate seems quite low
and reasonable.

That theory seems to be making some sense.  See the escript file in
the data dir for polling the /sys/path/to/eth0/rx_bytes counter.
Either there's timing error (because the kernel's timer resolution is
100Hz and therefore Erlang's smallest timer firing is 100Hz aka 10
milliseconds), or it's true that these micro-bursts are happening on
final egress out of the switching fabric.  Bummer.

---

ets:match() elimination
-----------------------

Data files in /Users/fritchie/b/notes/2011-12-07.XXXer.data/ets-match-stuff/*

prod-2000 has pthreads locking (not default custom locks) and no
-fomit-frame-pointer (for decent stack traces).

[root@prod-2000 ~]# awk '/PID/ { c = 0} /beam.smp/ && c < 5 { usr += $3; sys += $4; lck += $8; slp += $9} END { print "usr, sys, lck, slp" ; print usr, sys, lck, slp; }' /root/prstat.before-slf
usr, sys, lck, slp
7692.6 825.5 57352 5297.3
[root@prod-2000 ~]# awk '/PID/ { c = 0} /beam.smp/ && c < 5 { usr += $3; sys += $4; lck += $8; slp += $9} END { print "usr, sys, lck, slp" ; print usr, sys, lck, slp; }' /root/prstat.after-slf-draft3
usr, sys, lck, slp
5073.4 553.2 58133.1 4885

Makes a pretty big difference for USR time.

prod-2009 is stock riak 1.0.2 + eleveldb

[root@prod-2009 ~]# awk '/PID/ { c = 0} /beam.smp/ && c < 5 { usr += $3; sys += $4; lck += $8; slp += $9} END { print "usr, sys, lck, slp" ; print usr, sys, lck, slp; }' /root/prstat.before-slf
usr, sys, lck, slp
1346.8 650.6 59578 5631.3
[root@prod-2009 ~]# awk '/PID/ { c = 0} /beam.smp/ && c < 5 { usr += $3; sys += $4; lck += $8; slp += $9} END { print "usr, sys, lck, slp" ; print usr, sys, lck, slp; }' /root/prstat.after-slf-draft3
usr, sys, lck, slp
1269.4 584.8 56486 5512.3

--------

Crazy experimentation with plockstat.

I've managed to use a hacked version of plockstat for up to 250
milliseconds at a time. (Locks are really, really hot and generate
many thousands of DTrace events in that 1/4 second.) The differences
between 1/4 second samples are striking: the highest contended locks
fall into several different areas. More sampling (more than 6 or 7) is
required, and some automated script hacks to try to figure out what
the truly hot areas are.

[root@prod-2009 ~]# for i in a b c d e f; do dtrace -x dynvarsize=4m -s /root/plockstat.d -p 28476 > /tmp/foo250.$i ; sleep 2 ; done

---

my.joyentcloud.com
devops@XXXer.com
XXXerJoyentPasswordHere

From roughly Dec 23 00:00 through Dec 27 17:00 ... some polling
intervals are every 30 seconds, sometimes every 5 seconds, sometimes
polling not running at all.

    awk '/ PST 2011$/ {c = 0 } /riak_kv_vnode/ { c++ } /riak_kv_vnode/ && c == 1 {print}' ~/XXXer.txt | sed 's/[{},]/ /g' | awk '{ print $4}' | sed -e 's/.$//' -e 's/.$//' -e 's/.$//'  | sort -nr | uniq -c

instances   vnode mailbox size x 1000
---------   -------------------------
        1   258-259
        1   70-70
        1   54-55
        1   28-29
        5    5-6
       17    4-5
       80    3-4
      424    2-3
     1233    1-2
    26882    0-1


---

So, using the rto_min setting has some kind of units conversion
problem.  If you actually specify "200ms", you end up getting 120
*seconds* of timeout.  "1ms" seems to do about the same thing.

Using "1ns" or "10ns" or perhaps "100ns" seems fine: there's never an
"iperf" run using less bandwidth than 250 Mbit/sec.

root@prod-04:~# for size in 1ns 10ns 1ns 100ns 1ns 1us 1ns 10us 1ns; do echo "HEY, changing to $size"; ip route change to 10.28.60.192/26 dev eth0 rto_min $size; ip route show to 10.28.60.192/26 ; for i in `seq 1 50`; do iperf -c prod-18-internal.XXXer.com -n 15M 2>&1 | grep sec ; sleep 0.1; done; done
HEY, changing to 1ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.2 sec  15.0 MBytes    514 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    592 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    665 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    594 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    546 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    657 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    493 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    393 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    585 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    576 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    616 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    559 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    584 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    566 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    523 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    434 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    443 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    658 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    703 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    606 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    576 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    606 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    626 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    526 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    570 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    583 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    515 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    625 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    676 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    511 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    631 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    586 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    564 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    550 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    658 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    575 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    487 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    655 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    563 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    574 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    637 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    589 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    572 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    576 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    573 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    603 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    660 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    618 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    526 Mbits/sec
HEY, changing to 10ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.2 sec  15.0 MBytes    546 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    557 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    598 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    584 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    543 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    642 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    582 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    657 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    634 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    646 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    624 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    608 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    489 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    556 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    538 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    520 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    605 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    523 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    576 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    561 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    543 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    402 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    555 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    619 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    610 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    587 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    638 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    610 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    605 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    588 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    534 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    617 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    537 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    601 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    672 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    658 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    624 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    635 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    629 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    631 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    604 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    656 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    735 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    423 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    443 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    573 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    652 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    544 Mbits/sec
HEY, changing to 1ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.2 sec  15.0 MBytes    629 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    457 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    608 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    508 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    760 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    574 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    591 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    555 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    622 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    453 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    398 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    647 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    776 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    621 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    570 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    622 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    546 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    614 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    500 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    525 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    569 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    570 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    628 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    646 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    461 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    601 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    749 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    551 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    442 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    474 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    564 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    650 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    608 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    529 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    529 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    642 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    547 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    512 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    616 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    730 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    678 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    574 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    495 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    524 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    531 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    565 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    611 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    625 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    630 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    560 Mbits/sec
HEY, changing to 100ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.3 sec  15.0 MBytes    470 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    587 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    640 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    670 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    521 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    518 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    632 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    543 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    579 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    637 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    673 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    503 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    502 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    575 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    619 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    638 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    608 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    592 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    610 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    509 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    521 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    356 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    659 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    647 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    732 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    648 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    624 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    587 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    646 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    606 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    566 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    507 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    498 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    486 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    697 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    767 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    476 Mbits/sec
[  3]  0.0- 0.1 sec  15.0 MBytes    863 Mbits/sec
[  3]  0.0- 0.1 sec  15.0 MBytes    849 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    752 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    783 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    755 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    749 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    793 Mbits/sec
[  3]  0.0- 0.1 sec  15.0 MBytes    872 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    771 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    743 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    788 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    725 Mbits/sec
HEY, changing to 1ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.2 sec  15.0 MBytes    801 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    775 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    795 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    750 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    779 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    809 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    606 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    532 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    380 Mbits/sec
[  3]  0.0- 0.5 sec  15.0 MBytes    273 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    366 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    349 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    288 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    501 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    648 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    719 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    505 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    298 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    573 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    397 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    528 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    598 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    699 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    726 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    579 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    585 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    614 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    661 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    558 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    588 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    584 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    586 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    400 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    349 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    512 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    521 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    652 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    668 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    618 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    615 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    704 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    590 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    481 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    544 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    502 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    586 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    686 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    656 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    462 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    341 Mbits/sec
HEY, changing to 1us
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.3 sec  15.0 MBytes    370 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    536 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    633 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    516 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    531 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    458 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    542 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    608 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    711 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    575 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    637 Mbits/sec
[  3]  0.0- 1.2 sec  15.0 MBytes    103 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    618 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    713 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    629 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    705 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    650 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    572 Mbits/sec
[  3]  0.0- 1.3 sec  15.0 MBytes  95.5 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    677 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    662 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    611 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    596 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    581 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    618 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    625 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    568 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    641 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    581 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    607 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    600 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    626 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    581 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    671 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    524 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    480 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    560 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    603 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    686 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    618 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    536 Mbits/sec
[  3]  0.0- 1.4 sec  15.0 MBytes  92.2 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    523 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    525 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    522 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    487 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    620 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    419 Mbits/sec
HEY, changing to 1ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.2 sec  15.0 MBytes    573 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    574 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    618 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    543 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    640 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    552 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    375 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    488 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    458 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    486 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    583 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    645 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    624 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    579 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    566 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    542 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    581 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    626 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    685 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    647 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    722 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    452 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    553 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    507 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    497 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    666 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    730 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    441 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    555 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    755 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    609 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    525 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    651 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    719 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    636 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    518 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    641 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    671 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    595 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    622 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    626 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    579 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    377 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    366 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    378 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    434 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    697 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    636 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    694 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    702 Mbits/sec
HEY, changing to 10us
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.01ms
[  3]  0.0- 0.2 sec  15.0 MBytes    685 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    614 Mbits/sec
[  3]  0.0-10.3 sec  15.0 MBytes  12.2 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    666 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    472 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    528 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    654 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    457 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    644 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    620 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    574 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    494 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    569 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    672 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    613 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    673 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    528 Mbits/sec
[  3]  0.0-10.2 sec  15.0 MBytes  12.3 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    589 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    585 Mbits/sec
[  3]  0.0-10.2 sec  15.0 MBytes  12.3 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    677 Mbits/sec
[  3]  0.0-10.2 sec  15.0 MBytes  12.3 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    694 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    505 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    502 Mbits/sec
[  3]  0.0-10.2 sec  15.0 MBytes  12.3 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    704 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    444 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    444 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    526 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    583 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    686 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    727 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    793 Mbits/sec
[  3]  0.0- 0.1 sec  15.0 MBytes    971 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    639 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    483 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    516 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    576 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    620 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    560 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    574 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    570 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    552 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    635 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    576 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    655 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    564 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    730 Mbits/sec
HEY, changing to 1ns
10.28.60.192/26 dev eth0  scope link  rto_min lock 0.00ms
[  3]  0.0- 0.2 sec  15.0 MBytes    834 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    759 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    579 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    565 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    458 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    388 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    497 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    454 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    509 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    415 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    644 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    512 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    608 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    627 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    594 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    518 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    686 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    485 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    539 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    606 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    646 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    680 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    638 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    547 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    634 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    601 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    504 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    413 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    522 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    796 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    796 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    385 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    412 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    388 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    594 Mbits/sec
[  3]  0.0- 0.3 sec  15.0 MBytes    423 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    688 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    572 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    519 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    652 Mbits/sec
[  3]  0.0- 0.4 sec  15.0 MBytes    317 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    606 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    538 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    550 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    529 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    545 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    664 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    687 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    737 Mbits/sec
[  3]  0.0- 0.2 sec  15.0 MBytes    582 Mbits/sec

----

Changed the RTO_min value for all SL cluster members on
Wed Dec 28 18:47:43 PST 2011.

I screwed up prod-09, which now has a 2x1GigE bonded interface.
Adding explicit routes on that machine to one of the member interfaces
was not a good idea.  99th% latencies shot pretty high for a little
while there, drat.  Otherwise, things went smoothly.  prod-09 was the
only machine that I screwed up, so no Riak operations failed.

Starting Wed Dec 28 21:06:12 PST 2011 - Wed Dec 28 21:14:44 PST 2011,
I used the net_kernel:disconnect() & reconnect trick (listed above) to
reset TCP connections between all Erlang nodes.  I was still seeing
temporary (1 or 2 minute) see-saws in the 4-hour Cacti stats windows,
which suggest that perhaps that the old RTO_min value was being used??
We'll see if more of those dips appear now or not -- it may be too
late in the day to see the necessary traffic pattern to trigger them,
so wait until tomorrow?

WOW, that caused some unintentional chaos monkey.  Apparently my
innocent Riak internal networking disconnect & recoonect caused very
visible problems.  Among other things, all node.js processes in
Seattle restarted: a bug in XXXer's Riak client was found in an error
handler that caused another exception (and then cascading?).

So, if we ever do that again, we need to be much more careful.

----

1st week January 2012:

Libumem tuning, leak analysis, fragmentation analysis

See Campfire transcripts and Greg Burd for more details.

----

Tue Jan 10 2012:

leveldb on prod-2024: weird slow behavior for one vnode.

[lots of bogus D script and output omittied here.]

AH!  The D script was incorrect!  I mixed self and this, doh!

On a normal prod-2010:

[root@prod-2010 ~]# for i in 1 2 3 4 ; do date ; dtrace -n 'BEGIN { total = 0; mystart = timestamp; } pid$target::eleveldb_write:entry {self->start = timestamp;} pid$target::eleveldb_write:return /self->start > 0/ { this->nsec = timestamp - self->start ; @ = quantize((this->nsec)/1000000) ; self->start = 0; total = total + this->nsec; this->nsec = 0} tick-5sec { printa(@); printf("Total milliseconds in func: %d out of %d\n", total/1000000, (timestamp - mystart) / 1000000); exit(0);}' -p 2159 ; done
fixit for write
January 11, 2012 09:26:48 PM UTC
dtrace: description 'BEGIN ' matched 4 probes
CPU     ID                    FUNCTION:NAME
 15  70345                       :tick-5sec

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 677
               1 |                                         2
               2 |                                         0

Total milliseconds in func: 37 out of 4877
[...]

[root@prod-2010 ~]# echo fixit ; for i in 1 2 3 4 ; do date ; dtrace -n 'BEGIN { total = 0; mystart = timestamp; } pid$target::eleveldb_get:entry {self->start = timestamp;} pid$target::eleveldb_get:return /self->start > 0/ { this->nsec = timestamp - self->start ; @ = quantize((this->nsec)/1000000) ; self->start = 0; total = total + this->nsec; this->nsec = 0} tick-5sec { printa(@); printf("Total milliseconds in func: %d out of %d\n", total/1000000, (timestamp - mystart) / 1000000); exit(0);}' -p 2159 ; done
January 11, 2012 09:28:22 PM UTC
dtrace: description 'BEGIN ' matched 4 probes
CPU     ID                    FUNCTION:NAME
  5  70345                       :tick-5sec

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    2185
               1 |@                                        78
               2 |                                         12
               4 |                                         11
               8 |                                         28
              16 |                                         15
              32 |                                         4
              64 |                                         0
             128 |                                         0
             256 |                                         2
             512 |                                         0

Total milliseconds in func: 2204 out of 4876
[...]

----

OK, now, let's count function calls during an eleveldb_get func call:

Again, on prod-2010, our "normal" node.

[root@prod-2010 ~]# for i in 1 2 3 4; do date ; dtrace -n 'BEGIN { total = 0; mystart = timestamp; } pid$target::eleveldb_get:entry {self->active = 1; total++;} pid$target::eleveldb_get:return {self->active = 0;}  syscall:::entry /pid == $target && self->active == 1/ { @[probefunc] = count(); } syscall::write*:entry /pid == $target && self->active == 1/ { @w[ustack()] = count(); } tick-5sec { printa(@); printf("Total calls to func: %d\n", total); printa(@w); exit(0);}' -p 2159 ; done
January 11, 2012 09:31:52 PM UTC
dtrace: description 'BEGIN ' matched 239 probes
CPU     ID                    FUNCTION:NAME
 14  70345                       :tick-5sec
  close                                                           551
  open                                                            552
  lwp_park                                                       1224
  pread                                                          2222
Total calls to func: 2973
[...]

OK, here's what I do for prod-2024:
    * Put riak_kv service up
    * Wait for msgq on bad vnode to get to 1,600-2,000 messages.
      It takes about 60 seconds (!) for 1,600 messages to drain.
    * Put riak_kv service down
    * Wait a sec or two
    * Start this dtrace script:

[root@prod-2024 ~]# for i in 1 2 3 4; do date ; dtrace -n 'BEGIN { total = 0; mystart = timestamp; } pid$target::eleveldb_get:entry {self->active = 1; total++;} pid$target::eleveldb_get:return {self->active = 0;}  syscall:::entry /pid == $target && self->active == 1/ { @[probefunc] = count(); } syscall::write*:entry /pid == $target && self->active == 1/ { @w[ustack()] = count(); } tick-5sec { printa(@); printf("Total calls to func: %d\n", total); printa(@w); exit(0);}' -p 7610 ; done
January 11, 2012 09:51:34 PM UTC
dtrace: description 'BEGIN ' matched 239 probes
CPU     ID                    FUNCTION:NAME
  7  69397                       :tick-5sec
  mmap                                                              1
  lwp_park                                                         20
  close                                                          1655
  open                                                           1669
  pread                                                          8971
Total calls to func: 14332

January 11, 2012 09:51:41 PM UTC
dtrace: description 'BEGIN ' matched 239 probes
CPU     ID                    FUNCTION:NAME
 15  69397                       :tick-5sec
  lwp_park                                                         73
  close                                                           529
  open                                                            536
  pread                                                          2696
Total calls to func: 5408
[...]

WEIRD.  There are at *most* 2K vnode operations in the vnode's
mailbox.  Yet the eleveldb_get function is being called many *many*
more times than that.

HRM, WEIRDER, that many-thousand calls per 5 sec happens only occasionally.
Bimodal behavior.  And it's less common than the only mode, only a
few hundred calls per 5 sec.

So ... if we use prstat during the mailbox draining of vnode ops, we
see this: 98-99% system time for one LWP.

Total: 51 processes, 777 lwps, load averages: 1.10, 1.10, 0.96
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  7610 riak     1.5  98 0.0 0.0 0.0 0.0 0.0 0.0   0  13 208   0 beam.smp/582

So, let's see if we can create a D script that will look only at that
LWP.  Viz. DTrace book page 80.

BUMMER, the sched provider isn't available inside a zone.

SO, it's possible to use tid == LWP# as a guard to look at only a
single thread.  Hooray.  Using "prstat -mLc" as shown above has a
single LWP at 98-99% system time.  That LWP changes periodically: the
Erlang VM schedule migrates from time to time, and is prone to change
when the process is having DTrace probes enabled.

So, I'm eyeballing "prstat -mLc 1" output to catch which LWP is the
98% system time LWP, running probes guarded by that tid, and then
continue watching "prstat" output so that the same system time-eating
LWP remains through the entire dtrace snapshot of 2-4 seconds.  Those
separate outputs are below ... I'll try to combine them using awk to
see if there's anything interesting to see when all of the runs are
aggregated.

---

^C[root@prod-2024 ~]# dtrace -n 'profile-997 /execname == "beam.smp" && arg1 && d == 581/ { @[ufunc(arg1)] = count(); } tick-2s {printa(@);  exit(0);}'
dtrace: description 'profile-997 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  69398                         :tick-2s
  beam.smp`mbc_realloc                                              1
  beam.smp`enc_atom                                                 1
  beam.smp`erts_dsig_send_msg                                       1
  beam.smp`small_times                                              1
  eleveldb.so`_ZN7leveldb6DBImpl5WriteERKNS_12WriteOptionsEPNS_10WriteBatchE                1
  eleveldb.so`_ZN7leveldb8FindFileERKNS_21InternalKeyComparatorERKSt6vectorIPNS_12FileMetaDataESaIS5_EERKNS_5SliceE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIteratorD1Ev                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115ShardedLRUCache6LookupERKNS_5SliceE                1
  eleveldb.so`_ZN7leveldb6crc32c6ExtendEjPKcm                       1
  eleveldb.so`_ZN7leveldb5Block11NewIteratorEPKNS_10ComparatorE                1
  eleveldb.so`_ZN7leveldb5Block4IterD0Ev                            1
  libumem.so.1`umem_cache_alloc                                     1
  libc.so.1`memcmp                                                  1
  libc.so.1`memcpy                                                  1
  libc.so.1`localtime_r                                             1
  libc.so.1`clear_lockbyte                                          1
  libc.so.1`mutex_unlock_queue                                      1
  libc.so.1`mutex_lock_impl                                         2
  beam.smp`process_main                                             3

[root@prod-2024 ~]# dtrace -n 'profile-997 /execname == "beam.smp" && arg1 && tid == 582/ { @[ufunc(arg1)] = count(); } tick-2s {printa(@);  exit(0);}'
dtrace: description 'profile-997 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  7  69398                         :tick-2s
  beam.smp`buf_to_intlist                                           1
  beam.smp`io_list_to_buf                                           1
  beam.smp`encode_size_struct2                                      1
  beam.smp`binary_to_term_1                                         1
  beam.smp`tag_val_def                                              1
  eleveldb.so`_ZN7leveldb6DBImpl19NewInternalIteratorERKNS_11ReadOptionsEPm                1
  eleveldb.so`_ZNK7leveldb8SkipListIPKcNS_8MemTable13KeyComparatorEE18FindGreaterOrEqualERKS2_PPNS5_4NodeE                1
  eleveldb.so`_ZN7leveldb10TableCache11NewIteratorERKNS_11ReadOptionsEmmPPNS_5TableE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIteratorD1Ev                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115ShardedLRUCache6LookupERKNS_5SliceE                1
  eleveldb.so`_ZN7leveldb6crc32c6ExtendEjPKcm                       1
  eleveldb.so`_ZN7leveldb5Block11NewIteratorEPKNS_10ComparatorE                1
  eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11ReadOptionsERKNS_11BlockHandleEPPNS_5BlockE                1
  libumem.so.1`umem_free                                            1
  libumem.so.1`vmem_xfree                                           1
  libc.so.1`strcmp                                                  1
  libc.so.1`clear_lockbyte                                          1
  libc.so.1`mutex_lock_impl                                         1
  libc.so.1`mutex_unlock                                            1
  eleveldb.so`_ZNK7leveldb5Table11NewIteratorERKNS_11ReadOptionsE                2
  libc.so.1`memcpy                                                  2
  beam.smp`process_main                                             4
  eleveldb.so`_ZN7leveldb8FindFileERKNS_21InternalKeyComparatorERKSt6vectorIPNS_12FileMetaDataESaIS5_EERKNS_5SliceE                4

[root@prod-2024 ~]# dtrace -n 'profile-997 /execname == "beam.smp" && arg1 && tid == 579/ { @[ufunc(arg1)] = count(); } tick-2s {printa(@);  exit(0);}'
dtrace: description 'profile-997 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  2  69398                         :tick-2s
  beam.smp`mbc_alloc                                                1
  beam.smp`remote_send                                              1
  beam.smp`enc_term                                                 1
  beam.smp`dsig_send                                                1
  beam.smp`big_norm                                                 1
  beam.smp`erts_bs_private_append                                   1
  beam.smp`erts_bs_append                                           1
  beam.smp`tag_val_def                                              1
  beam.smp`ethr_atomic32_inc                                        1
  eleveldb.so`_ZN7leveldb10TableCache11NewIteratorERKNS_11ReadOptionsEmmPPNS_5TableE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIteratorD0Ev                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115ShardedLRUCache6LookupERKNS_5SliceE                1
  eleveldb.so`_ZN7leveldb5Block4IterD0Ev                            1
  eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11ReadOptionsERKNS_11BlockHandleEPPNS_5BlockE                1
  libc.so.1`mutex_trylock_adaptive                                  1
  libc.so.1`mutex_unlock_queue                                      1
  beam.smp`encode_size_struct2                                      2
  beam.smp`erts_get_aligned_binary_bytes_extra                      2
  eleveldb.so`_ZN7leveldbL22GetLengthPrefixedSliceEPKc                2
  eleveldb.so`_ZN7leveldb8FindFileERKNS_21InternalKeyComparatorERKSt6vectorIPNS_12FileMetaDataESaIS5_EERKNS_5SliceE                2
  eleveldb.so`_ZN7leveldb6crc32c6ExtendEjPKcm                       2
  libc.so.1`pread                                                   2
  libc.so.1`memcpy                                                  3
  beam.smp`process_main                                             4

[root@prod-2024 ~]# dtrace -n 'profile-997 /execname == "beam.smp" && arg1 && tid == 581/ { @[ufunc(arg1)] = count(); } tick-4s {printa(@);  exit(0);}'
dtrace: description 'profile-997 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  9  69403                         :tick-4s
  beam.smp`tree_delete                                              1
  beam.smp`bf_unlink_free_block                                     1
  beam.smp`get_free_block                                           1
  beam.smp`and_2                                                    1
  beam.smp`eq                                                       1
  beam.smp`cmp                                                      1
  beam.smp`atom_to_list_1                                           1
  beam.smp`encode_size_struct2                                      1
  beam.smp`dsig_send                                                1
  beam.smp`new_binary                                               1
  beam.smp`small_to_big                                             1
  beam.smp`small_times                                              1
  beam.smp`erts_whereis_name                                        1
  beam.smp`enif_get_list_cell                                       1
  libstdc++.so.6.0.14`_ZdlPv                                        1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE                1
  eleveldb.so`_ZNK7leveldb8SkipListIPKcNS_8MemTable13KeyComparatorEE14KeyIsAfterNodeERKS2_PNS5_4NodeE                1
  eleveldb.so`_ZNK7leveldb8SkipListIPKcNS_8MemTable13KeyComparatorEE18FindGreaterOrEqualERKS2_PPNS5_4NodeE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5SliceE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5SliceE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIteratorD1Ev                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115ShardedLRUCache7ReleaseEPNS_5Cache6HandleE                1
  eleveldb.so`_ZN7leveldb5Block11NewIteratorEPKNS_10ComparatorE                1
  eleveldb.so`_ZN7leveldb5Block4IterD0Ev                            1
  eleveldb.so`_ZNK7leveldb12_GLOBAL__N_121PosixRandomAccessFile4ReadEmmPNS_5SliceEPc                1
  libumem.so.1`malloc                                               1
  libumem.so.1`umem_depot_alloc                                     1
  libumem.so.1`vmem_getseg                                          1
  libumem.so.1`vmem_alloc                                           1
  libc.so.1`memcpy                                                  1
  libc.so.1`___errno                                                1
  libc.so.1`mutex_lock_impl                                         1
  libc.so.1`cond_broadcast                                          1
  libc.so.1`pthread_self                                            1
  libumem.so.1`umem_cache_alloc                                     2
  libc.so.1`memcmp                                                  2
  eleveldb.so`_ZN7leveldb8FindFileERKNS_21InternalKeyComparatorERKSt6vectorIPNS_12FileMetaDataESaIS5_EERKNS_5SliceE                3
  eleveldb.so`_ZN7leveldb5Block4Iter4SeekERKNS_5SliceE                3
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115ShardedLRUCache6LookupERKNS_5SliceE                5
  beam.smp`process_main                                            10

^C[root@prod-2024 ~]# dtrace -n 'profile-997 /execname == "beam.smp" && arg1 && d == 585/ { @[ufunc(arg1)] = count(); } tick-4s {printa(@);  exit(0);}'
dtrace: description 'profile-997 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  5  69403                         :tick-4s
  beam.smp`mbc_free                                                 1
  beam.smp`tree_delete                                              1
  beam.smp`bf_get_free_block                                        1
  beam.smp`cmp                                                      1
  beam.smp`erts_send_message                                        1
  beam.smp`schedule                                                 1
  beam.smp`erts_gc_mixed_times                                      1
  beam.smp`decoded_size                                             1
  beam.smp`encode_size_struct2                                      1
  beam.smp`dec_term                                                 1
  beam.smp`build_term_list                                          1
  beam.smp`erts_find_export_entry                                   1
  libstdc++.so.6.0.14`_Znam                                         1
  eleveldb.so`_ZN7leveldbL22GetLengthPrefixedSliceEPKc                1
  eleveldb.so`_ZNK7leveldb8SkipListIPKcNS_8MemTable13KeyComparatorEE18FindGreaterOrEqualERKS2_PPNS5_4NodeE                1
  eleveldb.so`_ZN7leveldbL10UnrefEntryEPvS0_                        1
  eleveldb.so`_ZNK7leveldb5Table11NewIteratorERKNS_11ReadOptionsE                1
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIteratorD1Ev                1
  eleveldb.so`_ZN7leveldb4HashEPKcmj                                1
  eleveldb.so`_ZN7leveldb5BlockC1EPKcm                              1
  libumem.so.1`malloc                                               1
  libumem.so.1`highbit                                              1
  libumem.so.1`umem_alloc                                           1
  libumem.so.1`umem_free                                            1
  libc.so.1`offtime_u                                               1
  libc.so.1`pread                                                   1
  libc.so.1`mutex_lock                                              1
  libc.so.1`lmutex_lock                                             1
  libc.so.1`pthread_self                                            1
  beam.smp`bf_link_free_block                                       2
  beam.smp`tag_val_def                                              2
  eleveldb.so`_ZN7leveldb12_GLOBAL__N_115ShardedLRUCache7ReleaseEPNS_5Cache6HandleE                2
  eleveldb.so`_ZN7leveldb5Block11NewIteratorEPKNS_10ComparatorE                2
  eleveldb.so`_ZN7leveldb5Block4IterD0Ev                            2
  libc.so.1`clear_lockbyte                                          2
  eleveldb.so`_ZN7leveldb5Block4Iter4SeekERKNS_5SliceE                3
  libc.so.1`memcmp                                                  3
  libc.so.1`memcpy                                                  3
  libc.so.1`mutex_lock_impl                                         4
  beam.smp`process_main                                             6
  eleveldb.so`_ZN7leveldb8FindFileERKNS_21InternalKeyComparatorERKSt6vectorIPNS_12FileMetaDataESaIS5_EERKNS_5SliceE                7

WEIRD WEIRD WEIRD WEIRD WEIRD WEIRD WEIRD WEIRD
So, how does this work?

Steps taken:
    * riak_core_node_watcher:service_up(riak_kv, whereis(riak_kv_sup)).
    * Use etop to watch until mailbox backlog for pid 0.661.0 grows to
      about 500.
    * riak_core_node_watcher:service_down(riak_kv).
    * Move mouse pointer over to "iostat -x 1" window then hit ENTER a
      few times to help mark the point in time that I'd run service_down().
    * Move mouse over to "dtrace" window and run the script below.
    * Watch the dtrace window for output @ END.
    * Move mouse over to "iostat" window and press control-c to stop it.

What do I see in the "iostat" window?  Zero read IOPS for the entire
10 second period that the dtrace script is running.  Yet dtrace is
claiming that it has done several hundred pread(2) calls that have
taken 16 millseconds or longer.  WEIRD.

dtrace -n 'BEGIN {start = timestamp; } pid$target::eleveldb_get:entry {self->active = 1; total++;} pid$target::eleveldb_get:return {self->active = 0;}  syscall:::entry /pid == $target && self->active == 1/ { self->start = timestamp } syscall:::return /pid == $target && self->active == 1 && self->start > 0/ { @[probefunc] = quantize((timestamp - self->start) / 1000000) ; @l[probefunc] = sum(timestamp-self->start); self->start = 0; } tick-5sec { printa(@); printf("Total calls to func: %d in %d msec wall-clock time\n", total, (timestamp - start) / 1000000); normalize(@l, 1000000); printf("Report: total milliseconds per system call (remember, in parallel!):\n"); printa(@l); exit(0);}' -p 7610

dtrace: description 'BEGIN ' matched 470 probes
CPU     ID                    FUNCTION:NAME
  9  69404                      :tick-10sec
  close
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9
               1 |                                         0

  open
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
               1 |                                         0

  pread
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@                             171
               1 |                                         0
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              367
              32 |@                                        15
              64 |                                         0

Total calls to func: 604 in 9863 msec wall-clock time
Report: total milliseconds per system call (remember, in parallel!):

  close                                                             0
  open                                                              0
  pread                                                          9611

And here's the output of iostat -x 1 | egrep -v 'sd0|ramdisk',
starting shortly where I'd pressed ENTER to mark the time when I made
the transition for the riak_kv service from up -> down.

                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1        22.0  200.0 2816.6 10669.3  0.0  0.2    0.9   0  12
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1        11.0    0.0 1282.4    0.0  0.0  0.1    6.7   0   6
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         9.0    0.0 1089.1    0.0  0.0  0.1    6.1   0   5
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1        12.0    4.0 1534.9   16.0  0.0  0.1    5.2   0   8
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         6.0   44.0  768.5 5431.2  0.0  0.1    1.5   0   4
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0  200.9    0.0 8396.9  0.0  0.1    0.3   0   1


                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0


                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0  140.0    0.0 3399.8  0.0  0.0    0.2   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0  146.9    0.0 5718.7  0.0  0.0    0.3   0   1
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
                  extended device statistics
device      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
^C

---

Timeline for weird-pread-latency.mov:

0:00 -- Upper left: DTrace window, currently idle
        Lower left: Riak shell, currently idle
        Upper right: iostat -x 1 | egrep -v 'sd0|ramdisk'
        Lower right: etop output

0:03 -- Lower left: Enable the riak_kv service to start workload

0:08 -- Upper left: Get dtrace command ready

0:10 -- Lower right: Just watching the "MsgQ" column for a backlog
                     of about 600 messages.  Once we hit that point, there's
                     about 10 seconds of work after the riak_kv service
                     is disabled (i.e. avoid new work from outside)

1:14 -- Lower right: etop crashed, so I restarted it

2:00 -- Lower right: Finally above the 600 message mark, time to go to work

2:02 -- Lower left: Disable riak_kv service

2:03 -- Upper right: press ENTER a few times to mark when the riak_kv
                     service went down

2:05 -- Upper left: Start dtrace

2:17 -- Upper left: DTrace sript has finished.
        Upper right: Iostat reported one read operation for the entire
                     approx. 10 seconds that the D script ran.

---

ANSWER for the puzzling pread latency problem: leveldb is reading 30MB
per request (an entire leveldb file), and that file is cached, so it
isn't generating any disk I/O.  All the system time is the kernel
copying 30MB of data in 22-24 millisecond operations each.

dtrace -qn 'pid$target::eleveldb_get:entry {self->active = 1; total++;} pid$target::eleveldb_get:return {self->active = 0;}  syscall::pread:entry /pid == $target && self->active == 1 && arg2 > 1000000/ { printf("pread %d bytes offset %d fd %d path %s\n", arg2, arg3, arg0, fds[arg0].fi_pathname); }' -p 7610

Bummer, apparently fds[] array isn't available in a zone?  There is no
output from this script except this:

dtrace: error on enabled probe ID 3 (ID 318: syscall::pread:entry): invalid kernel access in action #4 at DIF offset 64
dtrace: error on enabled probe ID 3 (ID 318: syscall::pread:entry): invalid kernel access in action #4 at DIF offset 64

Fine.  So, we'll modify it a bit to provide a timestamp and omit the path.

dtrace -qn 'BEGIN { start = timestamp; } pid$target::eleveldb_get:entry {self->active = 1; total++;} pid$target::eleveldb_get:return {self->active = 0;}  syscall::pread:entry /pid == $target && self->active == 1 && arg2 > 1000000/ { printf("time %d pread %d bytes offset %d fd %d\n", (timestamp - start) / 1000000000, arg2, arg3, arg0); }' -p 7610

time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
time 3 pread 30929259 bytes offset 311941 fd 4397
[...]

dtrace -qn 'pid$target::_ZN7leveldb5SliceC1EPKcm:entry {this->size = (int) arg22; printf("size %d\n", this->size);}'

dtrace -n 'BEGIN { start = timestamp; } pid$target::_ZN7leveldb5SliceC1EPKcm:entry { self->data = arg1; self->size = (arg2 > 0) ? arg2 : 1; } pid$target::eleveldb_get:entry {self->active = 1; total++;} pid$target::eleveldb_get:return {self->active = 0;}  syscall::pread:entry /pid == $target && self->active == 1 && arg2 > 50000/ { printf("time %d pread %d bytes offset %d fd %d self->size %d\n", (timestamp - start) / 1000000000, arg2, arg3, arg0, self->size); tracemem(copyin(self->data, 64), 64); }' -p 14922


 11    318                      pread:entry time 6 pread 62177 bytes offset 1427179 fd 2163 self->size 53

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 10 00 00 00 03 0c b7 80 08 12 b1 5b ec 96 9b 2d  ...........[...-
        10: ce bf 72 b0 dd c0 08 12 98 cc e6 53 39 ac de 73  ..r........S9..s
        20: 34 9c cc 26 33 09 cd 7e 63 34 9a cd 27 33 99 a4  4..&3..~c4..'3..
        30: da 69 35 00 08 66 20 08 02 00 00 00 00 83 44 04  .i5..f .......D.

(riak@127.0.0.1)13> sext:decode(<<16#10, 16#00, 16#00, 16#00, 16#03, 16#0c, 16#b7, 16#80, 16#08, 16#12, 16#b1, 16#5b, 16#ec, 16#96, 16#9b, 16#2d,16#ce, 16#bf, 16#72, 16#b0, 16#dd, 16#c0, 16#08, 16#12, 16#98, 16#cc, 16#e6, 16#53, 16#39, 16#ac, 16#de, 16#73,16#34, 16#9c, 16#cc, 16#26, 16#33, 16#09, 16#cd, 16#7e, 16#63, 16#34, 16#9a, 16#cd, 16#27, 16#33, 16#99, 16#a4,16#da, 16#69, 16#35, 16#00, 16#08>>).
{o,<<"bodies_raw">>,<<"1323579490109_1454994645">>}

---

Later in the same day, Thu Jan 12, 6:21pm CST

Hrm.  Well, what's a person to do?  I wanted to do more investigation
with DTrace and eleveldb.so, but I wanted a shared lib that had pid
provider probes for class constructors, e.g. leveldb::Slice class.  So
I copied the eleveldb.so shared lib from prod-2000 to prod-2024,
stopped Riak, then restarted to use the new shared lib.

What I forgot was that the restart would start riak_kv service and
leave it running.  Duh!  The weird/crazy/but-also-nice thing is that
LevelDB apparently wrote enough data that it triggered a merge (?),
and now the problem SST file is gone.  *And* the latency problem has
disappeared.

I managed to make a copy of the allegedly-bad SST file, it's in the
tarball bad-latency.840833.sst.tar.gz.

-rw-r--r--  0 riak   riak 31242320 Jan  6 02:07 var/db/riak/leveldb/813531184842397132203223002586212227738165575680/840833.sst

... try to reproduce, starting from empty data dirs, Riak + eleveldb backend:

{ok, C} = riak:local_client().
[C:put(riak_object:new(<<"abcdabcdabcdabcdabcdabcdabcdabcdabcdabcdabcd">>, <<"qwerqwerqwerqwerqwerqwerqwerqwer">>, <<"valvalvalvalvalvalvalvalvalvalvalval">>)) || _ <- lists:seq(1,11*1000)].

... will create 3 .sst files.  Then clobber those 3 sst files with the
bad one from XXXer.  .. Bah, that doesn't work.

---

Run from sbb:

#!/bin/sh

for i in {0,1,2,3}{0,1,2,3,4,5,6,7,8,9} 40; do
	ssh -n -p 2222 root@prod-20$i.XXXer.com "/usr/bin/time curl http://prod-20$i.XXXer.com:8098/stats 2>&1 | awk '/real/ { print \"prod-20$i\", \$0}' " &
done | sort -nr -k 3

---

Ring partition ownership helper functions:

Fowned_partitions = fun(Owners, Node) -> [P || {P, Owner} <- Owners, Owner =:= Node] end.
Fvnode_pids = fun(Node) -> [Pid || {_,Pid,_,_} <- supervisor:which_children({riak_core_vnode_sup, Node})] end.
Frunning_vnodes = fun(Node) -> Pids = Fvnode_pids(Node), [rpc:call(Node, riak_core_vnode, get_mod_index, [Pid], 30000) || Pid <- Pids] end.
Factive_partitions = fun(Node) -> lists:foldl(fun({_,P}, Ps) -> ordsets:add_element(P, Ps) end, [], Frunning_vnodes(Node)) end.
Fpartitions = fun(Node, Ring) -> Owners = riak_core_ring:all_owners(Ring), Owned = ordsets:from_list(Fowned_partitions(Owners, Node)), Active = ordsets:from_list(Factive_partitions(Node)), Stopped = ordsets:subtract(Owned, Active), Secondary = ordsets:subtract(Active, Owned), Primary = ordsets:subtract(Active, Secondary), {Primary, Secondary, Stopped} end.

{ok, Ring1} = riak_core_ring_manager:get_my_ring().
{ok, RRing1} = riak_core_ring_manager:get_raw_ring().

(media-riak@prod-2024-internal.XXXer.com)33> riak_core_ring:index_owner(Ring1, 1444374665018431399630985401082889078019339583488).
'media-riak@prod-2024-internal.XXXer.com'

(media-riak@prod-2024-internal.XXXer.com)34> riak_core_ring:index_owner(Ring1, 552344857077206473969556670176954617780122943488).
'media-riak@prod-2023-internal.XXXer.com'

[Primary1, Secondary1, Stopped1] = Fpartitions(node(), Ring1).

(media-riak@prod-2024-internal.XXXer.com)57> lists:usort([riak_core_ring:next_owner(Ring1, II) || II <- Primary1 ++ Secondary1]).
[{undefined,undefined,undefined}]

Hmmmm, that isn't good, I don't believe: next_owner() is doing most of
the heavy lifting for riak_core_vnode:should_handoff().

{ok, RRing1} = riak_core_ring_manager:get_raw_ring().
element(7, RRing1). %% element 7 = #chstate_v2.next list
[]

----------

Interruption: error_logger <0.6.0> on prod-2000 is apparently hung.

(media-riak@prod-2000-internal.XXXer.com)19> io:format("~s\n", [element(2,process_info(pid(0,6,0), backtrace))]).
Program counter: 0x00000000030bb790 (gen:do_call/4 + 576)
CP: 0x0000000000000000 (invalid)
arity = 0

0xfffffd7fe59ec4c0 Return addr 0x00000000030c32d0 (gen_event:rpc/2 + 96)
y(0)     #Ref<0.0.76.177386>
y(1)     'media-riak@prod-2000-internal.XXXer.com'
y(2)     []
y(3)     infinity
y(4)     {sync_notify,{log,3,{["2012",45,["0",49],45,"14"],["16",58,"24",58,["0",50],46,"534"]},[["[","error","] "],["<0.22085.415>",32],"webmachine error: path=\"/riak/bodies_raw/1326326952460_e24032fb530dcd8a_ba15cfa417c2a5a2\"\n{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}"]}}
y(5)     <0.6.0>
y(6)     <0.41.0>

0xfffffd7fe59ec500 Return addr 0x0000000004777e18 (error_logger_lager_h:handle_event/2 + 6072)

0xfffffd7fe59ec508 Return addr 0x00000000030c73c0 (gen_event:server_update/4 + 272)
y(0)     []
y(1)     []
y(2)     []
y(3)     []
y(4)     []
y(5)     []
y(6)     {}

0xfffffd7fe59ec548 Return addr 0x00000000030c6fa8 (gen_event:server_notify/4 + 136)
y(0)     error_logger
y(1)     {error,<0.89.0>,{<0.22085.415>,"webmachine error: path=~p~n~p~n",["/riak/bodies_raw/1326326952460_e24032fb530dcd8a_ba15cfa417c2a5a2",{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}]}}
y(2)     {handler,error_logger_lager_h,false,{},<0.47.0>}
y(3)     {}
y(4)     error_logger_lager_h
y(5)     Catch 0x00000000030c73c0 (gen_event:server_update/4 + 272)

0xfffffd7fe59ec580 Return addr 0x00000000030c3ea0 (gen_event:handle_msg/5 + 256)
y(0)     error_logger
y(1)     handle_event
y(2)     {error,<0.89.0>,{<0.22085.415>,"webmachine error: path=~p~n~p~n",["/riak/bodies_raw/1326326952460_e24032fb530dcd8a_ba15cfa417c2a5a2",{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}]}}
y(3)     []

0xfffffd7fe59ec5a8 Return addr 0x00000000030de690 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     error_logger
y(2)     <0.2.0>

0xfffffd7fe59ec5c8 Return addr 0x00000000006df368 (<terminate process normally>)
y(0)     Catch 0x00000000030de6b0 (proc_lib:init_p_do_apply/3 + 88)

ok
(media-riak@prod-2000-internal.XXXer.com)20> io:format("~s\n", [element(2,process_info(pid(0,41,0), backtrace))]).
Program counter: 0x0000000004782d88 (io:wait_io_mon_reply/2 + 56)
CP: 0x0000000000000000 (invalid)
arity = 0

0xfffffd7ff73b8aa0 Return addr 0x00000000047806d8 (io:o_request/3 + 96)
y(0)     #Ref<0.0.76.177389>
y(1)     <0.37.0>

0xfffffd7ff73b8ab8 Return addr 0x00000000046fe798 (lager_console_backend:handle_event/2 + 752)
y(0)     []
y(1)     []
y(2)     put_chars
y(3)     {put_chars,unicode,[["16",58,"24",58,["0",50],46,"534"]," ",["[","error","] "],"webmachine error: path=\"/riak/bodies_raw/1326326952460_e24032fb530dcd8a_ba15cfa417c2a5a2\"\n{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}","\n"]}
y(4)     <0.37.0>

0xfffffd7ff73b8ae8 Return addr 0x00000000030c73c0 (gen_event:server_update/4 + 272)
y(0)     {state,6,false}

0xfffffd7ff73b8af8 Return addr 0x00000000030c6fa8 (gen_event:server_notify/4 + 136)
y(0)     lager_event
y(1)     {log,3,{["2012",45,["0",49],45,"14"],["16",58,"24",58,["0",50],46,"534"]},[["[","error","] "],["<0.22085.415>",32],"webmachine error: path=\"/riak/bodies_raw/1326326952460_e24032fb530dcd8a_ba15cfa417c2a5a2\"\n{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}"]}
y(2)     {handler,lager_console_backend,false,{state,6,false},<0.44.0>}
y(3)     {state,6,false}
y(4)     lager_console_backend
y(5)     Catch 0x00000000030c73c0 (gen_event:server_update/4 + 272)

0xfffffd7ff73b8b30 Return addr 0x00000000030c7088 (gen_event:server_notify/4 + 360)
y(0)     lager_event
y(1)     handle_event
y(2)     {log,3,{["2012",45,["0",49],45,"14"],["16",58,"24",58,["0",50],46,"534"]},[["[","error","] "],["<0.22085.415>",32],"webmachine error: path=\"/riak/bodies_raw/1326326952460_e24032fb530dcd8a_ba15cfa417c2a5a2\"\n{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}"]}
y(3)     []

0xfffffd7ff73b8b58 Return addr 0x00000000030c4008 (gen_event:handle_msg/5 + 616)
y(0)     {handler,lager_file_backend,false,{state,[{file,"/var/log/riak/error.log",3,{file_descriptor,prim_file,{#Port<0.1773>,11}},880192,false,10485760,[{hour,0}],5},{file,"/var/log/riak/console.log",6,{file_descriptor,prim_file,{#Port<0.1774>,12}},880193,false,10485760,[{hour,0}],5}]},<0.45.0>}

0xfffffd7ff73b8b68 Return addr 0x00000000030de690 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     lager_event
y(2)     <0.40.0>
y(3)     {<0.6.0>,#Ref<0.0.76.177386>}
y(4)     <0.6.0>

0xfffffd7ff73b8b98 Return addr 0x00000000006df368 (<terminate process normally>)
y(0)     Catch 0x00000000030de6b0 (proc_lib:init_p_do_apply/3 + 88)

ok

NOTE: We've been waiting on the same ref, #Ref<0.0.76.177389>, for
many minutes now.  :-(

This may be a DTrace-related I/O problem (though no Erlang VM probes
are enabled that I know of). Or not. I can't tell.

UPDATE: See below for more info.

---

OK, on to trying to see how many LevelDB calls might be executing
simultaneously.

dtrace -qn 'BEGIN { highest = 0; cnt = 0; } pid$target:eleveldb.so:eleveldb_*:entry { cnt++ ; highest = (cnt > highest) ? cnt : highest; self->start = timestamp; } pid$target:eleveldb.so:eleveldb_*:return /self->start > 0/ { cnt = (cnt < 1) ? 0 : (cnt -1); @ = quantize((timestamp - self->start) / 1000000); self->start = 0;} tick-5sec { printf("Highest is %d\n", highest); highest = 0;} tick-30sec { cnt = 0; /*printf("Breakdown in milliseconds\n"); printa(@);*/ trunc(@);} ' -p 14922

---

"Bad SST" ... may not be bad, but stupid app asking for the same
big-ass key, {<<"bodies_raw">>, <<"key">>}, zillions of times.

{_, M} = process_info(pid(0,544,0), messages).

Ks = lists:sort([element(2, element(4, element(2, GE))) || GE <- M, element(1, element(4, element(2, GE))) == riak_kv_get_req_v1]).
Weird = [BK || {<<"bodies_raw">>, <<"key">>} = BK <- Ks].

(media-riak@prod-2023-internal.XXXer.com)27> length(Weird).
30590

And that key is 50MB.

(media-riak@prod-2024-internal.XXXer.com)9> file:write_file("/tmp/bad-sst-but-really-big-bkey-get.bin", term_to_binary(M)).

----

(media-riak@prod-2023-internal.XXXer.com)1> {ok, Ref} = eleveldb:open("/var/db/riak/leveldb/812103937149691172322164716616762732601782829056", []).
{ok,<<>>}

(media-riak@prod-2023-internal.XXXer.com)3> X = eleveldb:fold_keys(Ref, fun(X, Acc) -> [X|Acc] end, [], []).

(Media-riak@prod-2023-internal.XXXer.com)4> file:write_file("/var/tmp/keys-just-in-case.bin", term_to_binary(X)).
ok

(media-riak@prod-2023-internal.XXXer.com)5> length(X).
3864400
(media-riak@prod-2023-internal.XXXer.com)6> sext:decode(hd(X)).
{o,<<"messages">>,
   <<"zyana.casillas.fb.100002111391145:contacts">>}


fun() ->
    {ok, Ref} = eleveldb:open("/var/db/riak/leveldb/812103937149691172322164716616762732601782829056", []),
    {ok, FH} = file:open("/var/tmp/812103937149691172322164716616762732601782829056.slfhackdump", [write, binary, raw, compressed]),
    WriteOne = fun(X, {FH2, Count}) ->
         XBin = term_to_binary(X),
         XBinSize = size(XBin),
         file:write(FH2, [<<XBinSize:32>>, XBin]),
         if Count /= 100 -> ok; true -> error_logger:info_msg("Hack at ~p, TEST IT!\n", [Count]), file:close(FH2), exit(normal) end,
         if Count rem 10000 == 0 -> error_logger:info_msg("Hack dump wrote ~p records\n", [Count]); true -> ok end,
         {FH2, Count+1}
    end,
    {_, Final} = eleveldb:fold_keys(Ref, WriteOne, {FH, 1}, []),
    error_logger:info_msg("Hack dump final: wrote ~p records\n", [Final]),
    exit(normal)
    end.

---

dtrace -qn 'BEGIN { highest = 0; cnt = 0; } pid$target:eleveldb.so:eleveldb_*:entry,pid$target:eleveldb.so:*DoCompactionWork*:entry { cnt++ ; highest = (cnt > highest) ? cnt : highest; } pid$target:eleveldb.so:eleveldb_*:return,pid$target:eleveldb.so:*DoCompactionWork*:return { cnt = (cnt < 1) ? 0 : (cnt -1); } tick-5sec { printf("Highest is %d\n", highest); highest = 0;} tick-30sec { printf("reset\n"); cnt = 0;} ' -p 23271
Highest is 9
Highest is 8
Highest is 7
Highest is 10
Highest is 10
[...]

dtrace -qn 'pid$target:eleveldb.so:*DoCompactionWork*:entry {self->start = timestamp} pid$target:eleveldb.so:*DoCompactionWork*:return { printf("compaction %d msec\n", timestamp - self->start);}' -p 23271

... Hm hrm hmmmm very interesting, see 2011-12-07.XXXer.data/slf.sh and
slf.d implementation.

prod-2013 compaction 3445242949531516 -> 3445579140520676 = 336190 msec
prod-2016 compaction 3444935493726522 -> 3445264709366428 = 329215 msec
prod-2017 compaction 3445574248440866 -> 3445866808284379 = 292559 msec
prod-2016 compaction 3446367501174857 -> 3446645210490365 = 277709 msec
prod-2004 compaction 3446351606124390 -> 3446569990225495 = 218384 msec
prod-2009 compaction 3447014449109281 -> 3447087977597887 = 73528 msec
prod-2010 compaction 3446561322573171 -> 3446630495727881 = 69173 msec
prod-2009 compaction 3444289135231845 -> 3444357866368973 = 68731 msec
prod-2024 compaction 3218058352359332 -> 3218126783109779 = 68430 msec
prod-2002 compaction 3445177206907731 -> 3445244794304685 = 67587 msec
prod-2002 compaction 3446004235060663 -> 3446065053403479 = 60818 msec
prod-2011 compaction 3446845235168486 -> 3446904391635234 = 59156 msec
prod-2016 compaction 3444818206254735 -> 3444877176239571 = 58969 msec
prod-2006 compaction 3445691300975842 -> 3445747795173234 = 56494 msec
prod-2006 compaction 3447533560070312 -> 3447588734869068 = 55174 msec
prod-2000 compaction 3447230625813014 -> 3447284372201923 = 53746 msec
prod-2013 compaction 3446347422456985 -> 3446398488149603 = 51065 msec
prod-2002 compaction 3446112111566514 -> 3446160085767080 = 47974 msec
prod-2000 compaction 3446089380507528 -> 3446129998030687 = 40617 msec
prod-2009 compaction 3443726676664708 -> 3443766386818881 = 39710 msec
prod-2017 compaction 3443354706109044 -> 3443391235674135 = 36529 msec
prod-2006 compaction 3445627567244311 -> 3445658942845795 = 31375 msec
prod-2009 compaction 3446209955030151 -> 3446239145536200 = 29190 msec
prod-2020 compaction 3218239502629725 -> 3218266849931259 = 27347 msec

---

Notes on the plane (see notebook for originals)

* Eumerate _observed_ problems with Joyent cluster:
  1. ** Node 'media-riak@prod-20XX-internal.XXXer.com' not responding **
     NOTE: This happened 20 times to 16 unique machines as observed
           by prod-2039 on 15-Jan-2012
  2. Slow HTTP queries to /stats and (!) /ping URIs
     NOTE: This happens even when curl client is on localhost, so no
           network effects (e.g., packet loss) is a factor
  3. eleveldb interference during startup monster merges, e.g. wee hours
     of Sunday 15-Jan-2012.

* Investigate what all a HTTP /stats query does: how many Erlang
  processes does the service communicate with?  Though we also see
  latency problems with /ping URI, I've seen curl://localhost:8098/stats
  queries take over 60 seconds.

* We don't have any meaningful network stats history.  XXXer hasn't
  set up Cacti or equivalent yet.  So I believe we need to monitor
  this ourselves on each node and keep a history, using data from
  "netstat -s" and polled on a very frequent basis, e.g. 5 seconds.

* Would it be helpful to have a D script to watch the internal VM
  functions for sending & receiving distributed Erlang messages?  This
  could also confirm what "not responding" errors cluster-wide claim:
  that the complained-about node takes a nap/goes on holiday for at
  least 60 seconds and sometimes for several minutes.
  - We still don't know if this problem is related to network problems
    (e.g., packet loss), NIF interference with Erlang scheudler threads,
    both, or neither.

* Has Matt & crew found the app code path that's generating the
  bizillion requests for the bogus 30-50MB {<<"bodies_raw">>,
  <<"key">>} bkey?
  - If not, then how about patching get & put FSM's to look for that
    !@#$! bkey and complain whenever they see it?  Or ignore any op
    with that key?  Or return a HTTP 972 CRAZY REPLY to provoke their
    app into crashing ASAP?

* OT: create D to decode Erlang empty list, atom
* OT: resolve R15B source merge conflict
* OT: pester CDL about network monitoring best practices

----

Output from slf.d:

January 17, 2012 05:38:18 AM UTC
prod-2017 tid 593 compaction 3453448176397581 -> 3453774276161487 = 326099 msec

Report from prod-2000 (and all others, though not included here):
2012-01-17 05:36:56.216 ...** Node 'media-riak@prod-2017-internal.XXXer.com' not responding **
2012-01-17 05:36:56.492 ...net_kernel event: {nodedown,'media-riak@prod-2017-internal.XXXer.com',[{nodedown_reason,net_tick_timeout},{node_type,visible}]}
2012-01-17 05:38:58.641 ...net_kernel event: {nodeup,'media-riak@prod-2017-internal.XXXer.com',[{node_type,visible}]}

January 17, 2012 05:30:18 AM UTC
prod-2014 tid 594 compaction 3453278623074909 -> 3453568805849073 = 290182 msec

Report from prod-2000 (and all others, though not included here):
2012-01-17 05:29:24.449 ...** Node 'media-riak@prod-2014-internal.XXXer.com' not responding **
2012-01-17 05:29:24.678 ...net_kernel event: {nodedown,'media-riak@prod-2014-internal.XXXer.com',[{nodedown_reason,net_tick_timeout},{node_type,visible}]}
2012-01-17 05:30:29.045 ...net_kernel event: {nodeup,'media-riak@prod-2014-internal.XXXer.com',[{node_type,visible}]}

January 17, 2012 05:40:18 AM UTC
prod-2012 tid 593 compaction 3454206245670796 -> 3454284912945709 = 78667 msec

No nodedown events reported

January 17, 2012 05:00:18 AM UTC
prod-2010 tid 593 compaction 3451910297021077 -> 3451988140075538 = 77843 msec

No nodedown events reported

---

2012-01-18 Wednesday

On the LevelDB front, we appear to have found two different bugs:

* A single compaction op can "Expand" to add thousands of extra files
* While trying to recreate the above bug, found another apparent bug
  where a single eleveldb put trigger a cascade of 3-file (2-file?)
  compaction, dozens or hundreds of them, that blocks the put for 30+
  seconds.  Each little compaction appears to have the same number of
  files.

----

More on the error-logger-is-blocked front.

My reading of this is that the lager_event process, <0.43.0>, is
blocked waiting for a reply from its group leader, <0.39.0>.  The
group leader's mailbox is empty, so what happened?  Did the
{io_request, ...} message never arrive at the group leader, or did
the group leader's reply never arrive back?  In either case, how
could such intra-machine messages ever get lost??

(media-riak@prod-2000-internal.XXXer.com)5> lists:sort([{Size, Pid}
(media-riak@prod-2000-internal.XXXer.com)5>                     || Pid <- processes(),
(media-riak@prod-2000-internal.XXXer.com)5>                        {message_queue_len, Size} <-
(media-riak@prod-2000-internal.XXXer.com)5>                            [catch process_info(Pid, message_queue_len)],
(media-riak@prod-2000-internal.XXXer.com)5>                        Size > 0]).
[{1,<0.32.0>},{16,<0.43.0>},{261,<0.89.0>},{2856,<0.5.0>}]
(media-riak@prod-2000-internal.XXXer.com)6>
(media-riak@prod-2000-internal.XXXer.com)6>
(media-riak@prod-2000-internal.XXXer.com)6>
(media-riak@prod-2000-internal.XXXer.com)6> io:format("~s\n", [element(2,process_info(pid(0,5,0), backtrace))]).
Program counter: 0x00000000030aa5a0 (gen:do_call/4 + 576)
CP: 0x0000000000000000 (invalid)
arity = 0

0xfffffd7fd0445600 Return addr 0x00000000030b6a18 (gen_event:rpc/2 + 96)
y(0)     #Ref<0.0.22.18473>
y(1)     'media-riak@prod-2000-internal.XXXer.com'
y(2)     []
y(3)     infinity
y(4)     {sync_notify,{log,3,{["2012",45,["0",49],45,"19"],[["0",49],58,"31",58,"32",46,["0",50,50]]},[["[","error","] "],["<0.19513.124>",32],"webmachine error: path=\"/riak/bodies_raw/1326936634921_0291564814_f691cea\"\n{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}"]}}
y(5)     <0.5.0>
y(6)     <0.43.0>

0xfffffd7fd0445640 Return addr 0x0000000004781518 (error_logger_lager_h:handle_event/2 + 6072)

0xfffffd7fd0445648 Return addr 0x00000000030bab08 (gen_event:server_update/4 + 272)
y(0)     []
y(1)     []
y(2)     []
y(3)     []
y(4)     []
y(5)     []
y(6)     {}

0xfffffd7fd0445688 Return addr 0x00000000030ba6f0 (gen_event:server_notify/4 + 136)
y(0)     error_logger
y(1)     {error,<0.91.0>,{<0.19513.124>,"webmachine error: path=~p~n~p~n",["/riak/bodies_raw/1326936634921_0291564814_f691cea",{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}]}}
y(2)     {handler,error_logger_lager_h,false,{},<0.49.0>}
y(3)     {}
y(4)     error_logger_lager_h
y(5)     Catch 0x00000000030bab08 (gen_event:server_update/4 + 272)

0xfffffd7fd04456c0 Return addr 0x00000000030b75e8 (gen_event:handle_msg/5 + 256)
y(0)     error_logger
y(1)     handle_event
y(2)     {error,<0.91.0>,{<0.19513.124>,"webmachine error: path=~p~n~p~n",["/riak/bodies_raw/1326936634921_0291564814_f691cea",{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}]}}
y(3)     []

0xfffffd7fd04456e8 Return addr 0x00000000030afec8 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     error_logger
y(2)     <0.2.0>

0xfffffd7fd0445708 Return addr 0x00000000006df368 (<terminate process normally>)
y(0)     Catch 0x00000000030afee8 (proc_lib:init_p_do_apply/3 + 88)

ok
(media-riak@prod-2000-internal.XXXer.com)7> process_info(pid(054,0)).
** exception error: undefined shell command pid/2
(media-riak@prod-2000-internal.XXXer.com)8> process_info(pid(0,5,0)).
[{registered_name,error_logger},
 {current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,2858},
 {messages,[{notify,{error,<0.91.0>,
                           {<0.30723.124>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936665946_0712657087_890c94f",
                             {error,{exit,normal,
                                          [{webmachine_request,read_chunk_length,3},
                                           {webmachine_request,recv_chunked_body,3},
                                           {webmachine_request,read_whole_stream,...},
                                           {webmachine_request,...},
                                           {...}|...]}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.31283.124>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326927019020_4817631435_19183809",
                             {error,{exit,normal,
                                          [{webmachine_request,read_chunk_length,3},
                                           {webmachine_request,recv_chunked_body,...},
                                           {webmachine_request,...},
                                           {...}|...]}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.27205.124>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936639145_0eefb4d7ced50272_d431aae4335c28d6",
                             {error,{exit,normal,
                                          [{webmachine_request,read_chunk_length,...},
                                           {webmachine_request,...},
                                           {...}|...]}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.19987.125>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936749141_2996738267",
                             {error,{exit,normal,
                                          [{webmachine_request,...},{...}|...]}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.20989.125>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936319977_0755845112_1753401",
                             {error,{exit,normal,[{...}|...]}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.18515.121>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936525750_0871077864_f2af806",
                             {error,{exit,normal,[...]}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.24401.125>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936763467_1191873287",
                             {error,{exit,normal,...}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.7968.126>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/lrg_1326930983572_279395581_7b1d173f",
                             {error,{error,...}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.7101.126>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936457880_bab39ccd82aa0131_da56ea4b2b8643b3",
                             {error,{...}}]}}},
            {notify,{error,<0.91.0>,
                           {<0.21435.126>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936030608_c540d7f481f91100_ccd676e4dabd9b62",
                             {error,...}]}}},
            {notify,{error,<0.91.0>,
                           {<0.1781.127>,"webmachine error: path=~p~n~p~n",
                            ["/riak/bodies_raw/1326936083601_0249619846_0a82470",
                             {...}]}}},
            {notify,{error,<0.91.0>,
                           {<0.20431.127>,"webmachine error: path=~p~n~p~n",
                            [[...]|...]}}},
            {notify,{error,<0.91.0>,
                           {<0.28118.127>,"webmachine error: path=~p~n~p~n",[...]}}},
            {notify,{error,<0.91.0>,{<0.4580.128>,[...],...}}},
            {notify,{error,<0.91.0>,{<0.11898.128>,...}}},
            {notify,{error,<0.91.0>,{...}}},
            {notify,{error,<0.91.0>,...}},
            {notify,{error,...}},
            {notify,{...}},
            {notify,...},
            {...}|...]},
 {links,[<0.32.0>,<0.49.0>,<0.0.0>]},
 {dictionary,[{'$ancestors',[<0.2.0>]},
              {'$initial_call',{gen_event,init_it,6}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.32.0>},
 {total_heap_size,1346269},
 {heap_size,1346269},
 {stack_size,35},
 {reductions,6818201},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]
(media-riak@prod-2000-internal.XXXer.com)9> process_info(pid(0,43,0)).
[{registered_name,lager_event},
 {current_function,{io,wait_io_mon_reply,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,16},
 {messages,[{<0.11163.124>,
             {<0.11163.124>,#Ref<0.0.22.3946>},
             {sync_notify,{log,6,
                               {["2012",45,["0",49],45,"19"],
                                [["0",49],58,"31",58,"17",46,"799"]},
                               [["[","info","] "],
                                ["<0.11163.124>",64,"riak_core_handoff_sender",58,
                                 "start_fold",58,"39",32],
                                "Starting handoff of partition riak_kv_vnode 1084708246456529509604297336781616303650887434240 from 'media-riak@prod-2000-internal.XXXer.com' to 'media-riak@prod-2006-internal.XXXer.com'"]}}},
            {<0.11237.124>,
             {<0.11237.124>,#Ref<0.0.22.5130>},
             {sync_notify,{log,6,
                               {["2012",45,["0",49],45,"19"],
                                [["0",49],58,"31",58,"18",46,["0",50|...]]},
                               [["[","info","] "],
                                ["<0.11237.124>",64,"riak_core_handoff_receiver",58,
                                 "process_message",58,"101",32],
                                "Receiving handoff data for partition riak_kv_vnode:1113253200310648707225463056170606206378542366720"]}}},
            {<0.11241.124>,
             {<0.11241.124>,#Ref<0.0.22.5131>},
             {sync_notify,{log,6,
                               {["2012",45,["0",49],45,"19"],
                                [["0",49],58,"31",58,"18",46,[[...]|...]]},
                               [["[","info","] "],
                                ["<0.11241.124>",64,"riak_core_handoff_receiver",58,
                                 "process_message",58,
                                 [...]|...],
                                "Receiving handoff data for partition riak_kv_vnode:1076144760300293750317947620964919332832590954496"]}}},
            {<0.11289.124>,
             {<0.11289.124>,#Ref<0.0.22.5202>},
             {sync_notify,{log,6,
                               {["2012",45,["0",49],45,"19"],
                                [["0",49],58,"31",58,"18",46,[...]]},
                               [["[","info","] "],
                                ["<0.11289.124>",64,"riak_core_handoff_receiver",58,
                                 "process_message",58|...],
                                "Receiving handoff data for partition riak_kv_vnode:1001927880279583836502916750553545585740688130048"]}}},
            {<0.11293.124>,
             {<0.11293.124>,#Ref<0.0.22.5204>},
             {sync_notify,{log,6,
                               {["2012",45,["0",49],45,"19"],
                                [["0",49],58,"31",58,"18",46|...]},
                               [["[","info","] "],
                                ["<0.11293.124>",64,"riak_core_handoff_receiver",58,
                                 [...]|...],
                                "Receiving handoff data for partition riak_kv_vnode:667951920186389224335277833702363723827125420032"]}}},
            {<0.17488.124>,
             {<0.17488.124>,#Ref<0.0.22.13770>},
             {sync_notify,{log,6,
                               {["2012",45,["0",49],45,"19"],
                                [["0",49],58,"31",58,[...]|...]},
                               [["[","info","] "],
                                ["<0.17488.124>",64,"riak_core_handoff_receiver",58|...],
                                "Receiving handoff data for partition riak_kv_vnode:1447229160403843319393101973021788068292105076736"]}}},
            {<0.5.0>,
             {<0.5.0>,#Ref<0.0.22.18473>},
             {sync_notify,{log,3,
                               {["2012",45,["0",49],45,[...]],[["0",49],58,"31",58|...]},
                               [["[","error","] "],
                                ["<0.19513.124>",32],
                                "webmachine error: path=\"/riak/bodies_raw/1326936634921_0291564814_f691cea\"\n{error,{exit,normal,[{webmachine_request,read_chunk_length,3},{webmachine_request,recv_chunked_body,3},{webmachine_request,read_whole_stream,5},{webmachine_request,call,2},{wrq,req_body,1},{riak_kv_wm_object,accept_doc_body,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3}]}}"]}}},
            {<0.24660.124>,
             {<0.24660.124>,#Ref<0.0.22.24522>},
             {sync_notify,{log,6,
                               {["2012",45,[[...]|...],45|...],[["0",49],58,[...]|...]},
                               [["[","info",[...]],
                                ["<0.24660.124>",64|...],
                                "Receiving handoff data for partition riak_kv_vnode:593735040165679310520246963290989976735222595584"]}}},
            {<0.30213.125>,
             {<0.30213.125>,#Ref<0.0.22.84272>},
             {sync_notify,{log,6,
                               {["2012",45,[...]|...],[[[...]|...],58|...]},
                               [["[",[...]|...],[[...]|...],[...]]}}},
            {<0.2254.127>,
             {<0.2254.127>,#Ref<0.0.22.142846>},
             {sync_notify,{log,6,
                               {["2012",45|...],[[...]|...]},
                               [[[...]|...],[...]|...]}}},
            {<0.89.0>,
             {<0.89.0>,#Ref<0.0.24.65121>},
             {sync_notify,{log,6,{[[...]|...],[...]},[[...]|...]}}},
            {<0.29908.243>,
             {<0.29908.243>,#Ref<0.0.44.234047>},
             {sync_notify,{log,6,{[...],...},[...]}}},
            {<0.10551.244>,
             {<0.10551.244>,#Ref<0.0.44.253694>},
             {sync_notify,{log,6,{...},...}}},
            {<0.1660.325>,
             {<0.1660.325>,#Ref<0.0.60.172925>},
             {sync_notify,{log,6,...}}},
            {<0.11720.325>,
             {<0.11720.325>,#Ref<0.0.60.189769>},
             {sync_notify,{log,...}}},
            {<0.8324.340>,
             {<0.8324.340>,#Ref<0.0.63.180000>},
             {sync_notify,{...}}}]},
 {links,[<0.46.0>,<0.47.0>,#Port<0.1755082>,<0.42.0>,
         #Port<0.1755079>]},
 {dictionary,[{'$ancestors',[lager_sup,<0.40.0>]},
              {'$initial_call',{gen_event,init_it,6}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.39.0>},
 {total_heap_size,10946},
 {heap_size,10946},
 {stack_size,33},
 {reductions,975453},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]
(media-riak@prod-2000-internal.XXXer.com)10>
(media-riak@prod-2000-internal.XXXer.com)10>
(media-riak@prod-2000-internal.XXXer.com)10> io:format("~s\n", [element(2,process_info(pid(0,43,0), backtrace))]).
Program counter: 0x0000000004773650 (io:wait_io_mon_reply/2 + 56)
CP: 0x0000000000000000 (invalid)
arity = 0

0xfffffd7ffad53d58 Return addr 0x0000000004770fa0 (io:o_request/3 + 96)
y(0)     #Ref<0.0.22.3921>
y(1)     <0.39.0>

0xfffffd7ffad53d70 Return addr 0x0000000004700e78 (lager_console_backend:handle_event/2 + 752)
y(0)     []
y(1)     []
y(2)     put_chars
y(3)     {put_chars,unicode,[[["0",49],58,"31",58,"17",46,"796"]," ",["[","info","] "],"Starting handoff of partition riak_kv_vnode 1402984481929958563080295107968853719064239931392 from 'media-riak@prod-2000-internal.XXXer.com' to 'media-riak@prod-2021-internal.XXXer.com'","\n"]}
y(4)     <0.39.0>

0xfffffd7ffad53da0 Return addr 0x00000000030bab08 (gen_event:server_update/4 + 272)
y(0)     {state,6,false}

0xfffffd7ffad53db0 Return addr 0x00000000030ba6f0 (gen_event:server_notify/4 + 136)
y(0)     lager_event
y(1)     {log,6,{["2012",45,["0",49],45,"19"],[["0",49],58,"31",58,"17",46,"796"]},[["[","info","] "],["<0.11161.124>",64,"riak_core_handoff_sender",58,"start_fold",58,"39",32],"Starting handoff of partition riak_kv_vnode 1402984481929958563080295107968853719064239931392 from 'media-riak@prod-2000-internal.XXXer.com' to 'media-riak@prod-2021-internal.XXXer.com'"]}
y(2)     {handler,lager_console_backend,false,{state,6,false},<0.46.0>}
y(3)     {state,6,false}
y(4)     lager_console_backend
y(5)     Catch 0x00000000030bab08 (gen_event:server_update/4 + 272)

0xfffffd7ffad53de8 Return addr 0x00000000030ba7d0 (gen_event:server_notify/4 + 360)
y(0)     lager_event
y(1)     handle_event
y(2)     {log,6,{["2012",45,["0",49],45,"19"],[["0",49],58,"31",58,"17",46,"796"]},[["[","info","] "],["<0.11161.124>",64,"riak_core_handoff_sender",58,"start_fold",58,"39",32],"Starting handoff of partition riak_kv_vnode 1402984481929958563080295107968853719064239931392 from 'media-riak@prod-2000-internal.XXXer.com' to 'media-riak@prod-2021-internal.XXXer.com'"]}
y(3)     []

0xfffffd7ffad53e10 Return addr 0x00000000030b7750 (gen_event:handle_msg/5 + 616)
y(0)     {handler,lager_file_backend,false,{state,[{file,"/var/log/riak/error.log",3,{file_descriptor,prim_file,{#Port<0.1755082>,11}},1727659,false,10485760,[{hour,0}],5},{file,"/var/log/riak/console.log",6,{file_descriptor,prim_file,{#Port<0.1755079>,12}},1727658,false,10485760,[{hour,0}],5}]},<0.47.0>}

0xfffffd7ffad53e20 Return addr 0x00000000030afec8 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     lager_event
y(2)     <0.42.0>
y(3)     {<0.11161.124>,#Ref<0.0.22.3919>}
y(4)     <0.11161.124>

0xfffffd7ffad53e50 Return addr 0x00000000006df368 (<terminate process normally>)
y(0)     Catch 0x00000000030afee8 (proc_lib:init_p_do_apply/3 + 88)

ok
(media-riak@prod-2000-internal.XXXer.com)11>
(media-riak@prod-2000-internal.XXXer.com)11>
(media-riak@prod-2000-internal.XXXer.com)11>
(media-riak@prod-2000-internal.XXXer.com)11> process_info(pid(0,39,0)).
[{current_function,{application_master,main_loop,2}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.6.0>,<0.40.0>]},
 {dictionary,[{'$ancestors',[<0.38.0>]},
              {'$initial_call',{application_master,init,4}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.39.0>},
 {total_heap_size,233},
 {heap_size,233},
 {stack_size,6},
 {reductions,4369},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]
(media-riak@prod-2000-internal.XXXer.com)12> io:format("~s\n", [element(2,process_info(pid(0,39,0), backtrace))]).
Program counter: 0x00000000012f7130 (application_master:main_loop/2 + 64)
CP: 0x0000000000000000 (invalid)
arity = 0

0xfffffd7ff3a862b0 Return addr 0x00000000030afec8 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     {state,<0.40.0>,{appl_data,lager,[],undefined,{lager_app,[]},[error_logger_lager_h,lager,lager_app,lager_console_backend,lager_crash_log,lager_file_backend,lager_handler_watcher,lager_handler_watcher_sup,lager_mochiglobal,lager_stdlib,lager_sup,lager_transform,lager_trunc_io,lager_util],[],infinity,infinity},[],0,<0.32.0>}
y(2)     <0.6.0>

0xfffffd7ff3a862d0 Return addr 0x00000000006df368 (<terminate process normally>)
y(0)     Catch 0x00000000030afee8 (proc_lib:init_p_do_apply/3 + 88)

ok
(media-riak@prod-2000-internal.XXXer.com)13> x.
x
(media-riak@prod-2000-internal.XXXer.com)14>
(media-riak@prod-2000-internal.XXXer.com)14>
(media-riak@prod-2000-internal.XXXer.com)14>
(media-riak@prod-2000-internal.XXXer.com)14>
(media-riak@prod-2000-internal.XXXer.com)14>
(media-riak@prod-2000-internal.XXXer.com)14>
(media-riak@prod-2000-internal.XXXer.com)14> process_info(pid(0,89,0)).
[{registered_name,riak_sysmon_handler},
 {current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,305},
 {messages,[{notify,{monitor,<0.13317.112>,long_gc,
                             [{timeout,580},
                              {old_heap_block_size,0},
                              {heap_block_size,75025},
                              {mbuf_size,0},
                              {stack_size,33},
                              {old_heap_size,0},
                              {heap_size,25272}]}},
            {notify,{monitor,<0.26837.116>,long_gc,
                             [{timeout,119},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,0},
                              {stack_size,78},
                              {old_heap_size,0},
                              {heap_size,14577}]}},
            {notify,{monitor,<0.14736.112>,long_gc,
                             [{timeout,121},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,0},
                              {stack_size,78},
                              {old_heap_size,0},
                              {heap_size,14575}]}},
            {notify,{monitor,<0.7202.137>,long_gc,
                             [{timeout,266},
                              {old_heap_block_size,0},
                              {heap_block_size,4181},
                              {mbuf_size,0},
                              {stack_size,29},
                              {old_heap_size,0},
                              {heap_size,1448}]}},
            {notify,{monitor,<0.32645.111>,long_gc,
                             [{timeout,533},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,0},
                              {stack_size,30},
                              {old_heap_size,0},
                              {heap_size,14501}]}},
            {notify,{monitor,<0.14405.111>,long_gc,
                             [{timeout,595},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,0},
                              {stack_size,82},
                              {old_heap_size,0},
                              {heap_size,14489}]}},
            {notify,{monitor,<0.107.0>,long_gc,
                             [{timeout,728},
                              {old_heap_block_size,0},
                              {heap_block_size,75025},
                              {mbuf_size,0},
                              {stack_size,9},
                              {old_heap_size,0},
                              {heap_size,...}]}},
            {notify,{monitor,<0.14353.111>,long_gc,
                             [{timeout,603},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,0},
                              {stack_size,30},
                              {old_heap_size,...},
                              {...}]}},
            {notify,{monitor,<0.14164.139>,long_gc,
                             [{timeout,466},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,0},
                              {stack_size,...},
                              {...}|...]}},
            {notify,{monitor,<0.27900.116>,long_gc,
                             [{timeout,599},
                              {old_heap_block_size,0},
                              {heap_block_size,46368},
                              {mbuf_size,...},
                              {...}|...]}},
            {notify,{monitor,<0.13157.111>,long_gc,
                             [{timeout,599},
                              {old_heap_block_size,0},
                              {heap_block_size,...},
                              {...}|...]}},
            {notify,{monitor,<0.25009.111>,long_gc,
                             [{timeout,599},{old_heap_block_size,...},{...}|...]}},
            {notify,{monitor,<0.14045.112>,long_gc,
                             [{timeout,...},{...}|...]}},
            {notify,{monitor,<0.25009.111>,long_gc,[{...}|...]}},
            {notify,{monitor,<0.5408.112>,long_gc,[...]}},
            {notify,{monitor,<0.32645.111>,long_gc,...}},
            {notify,{monitor,<0.25144.111>,...}},
            {notify,{monitor,...}},
            {notify,{...}},
            {notify,...},
            {...}|...]},
 {links,[<0.87.0>,<0.94.0>]},
 {dictionary,[{'$ancestors',[riak_sysmon_sup,<0.86.0>]},
              {'$initial_call',{gen_event,init_it,6}}]},
 {trap_exit,true},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.85.0>},
 {total_heap_size,17711},
 {heap_size,17711},
 {stack_size,34},
 {reductions,112210},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]
(media-riak@prod-2000-internal.XXXer.com)15> io:format("~s\n", [element(2,process_info(pid(0,89,0), backtrace))]).
Program counter: 0x00000000030aa5a0 (gen:do_call/4 + 576)
CP: 0x0000000000000000 (invalid)
arity = 0

0xfffffd7ff86e2b40 Return addr 0x00000000030b6a18 (gen_event:rpc/2 + 96)
y(0)     #Ref<0.0.24.65121>
y(1)     'media-riak@prod-2000-internal.XXXer.com'
y(2)     []
y(3)     infinity
y(4)     {sync_notify,{log,6,{["2012",45,["0",49],45,"19"],[["0",49],58,"41",58,"18",46,"933"]},[["[","info","] "],["<0.89.0>",64,"riak_core_sysmon_handler",58,"handle_event",58,"85",32],"monitor long_gc <0.568.0> [{name,riak_kv_put_fsm_sup},{initial_call,{supervisor,riak_kv_put_fsm_sup,1}},{almost_current_function,{proc_lib,sync_wait,2}}] [{timeout,578},{old_heap_block_size,0},{heap_block_size,987},{mbuf_size,0},{stack_size,28},{old_heap_size,0},{heap_size,395}]"]}}
y(5)     <0.89.0>
y(6)     <0.43.0>

0xfffffd7ff86e2b80 Return addr 0x00000000048114f8 (riak_core_sysmon_handler:handle_event/2 + 648)

0xfffffd7ff86e2b88 Return addr 0x00000000030bab08 (gen_event:server_update/4 + 272)
y(0)     []
y(1)     []
y(2)     []
y(3)     []
y(4)     []
y(5)     {state}

0xfffffd7ff86e2bc0 Return addr 0x00000000030ba6f0 (gen_event:server_notify/4 + 136)
y(0)     riak_sysmon_handler
y(1)     {monitor,<0.568.0>,long_gc,[{timeout,578},{old_heap_block_size,0},{heap_block_size,987},{mbuf_size,0},{stack_size,28},{old_heap_size,0},{heap_size,395}]}
y(2)     {handler,riak_core_sysmon_handler,false,{state},<0.94.0>}
y(3)     {state}
y(4)     riak_core_sysmon_handler
y(5)     Catch 0x00000000030bab08 (gen_event:server_update/4 + 272)

0xfffffd7ff86e2bf8 Return addr 0x00000000030b75e8 (gen_event:handle_msg/5 + 256)
y(0)     riak_sysmon_handler
y(1)     handle_event
y(2)     {monitor,<0.568.0>,long_gc,[{timeout,578},{old_heap_block_size,0},{heap_block_size,987},{mbuf_size,0},{stack_size,28},{old_heap_size,0},{heap_size,395}]}
y(3)     []

0xfffffd7ff86e2c20 Return addr 0x00000000030afec8 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     riak_sysmon_handler
y(2)     <0.87.0>

0xfffffd7ff86e2c40 Return addr 0x00000000006df368 (<terminate process normally>)
y(0)     Catch 0x00000000030afee8 (proc_lib:init_p_do_apply/3 + 88)

ok
(media-riak@prod-2000-internal.XXXer.com)16>

** See my yellow notepad for more notes, but...

... erlang.log.4 shows that Greg (?) used the console successfully
    prior to the console going dead.

(media-riak@prod-2000-internal.XXXer.com)5>
rp(lists:keysort(2, [{N, element(1, rpc:call(N, erlang, statistics, [wall_clock]))} || N <- [node() | nodes()]])).

[{'media-riak@prod-2000-internal.XXXer.com',6519722},
 {'media-riak@prod-2021-internal.XXXer.com',7155611},
 {'media-riak@prod-2020-internal.XXXer.com',7183842},
 {'media-riak@prod-2022-internal.XXXer.com',7187230},
 {'media-riak@prod-2019-internal.XXXer.com',7195017},
 {'media-riak@prod-2023-internal.XXXer.com',7232344},
 {'media-riak@prod-2018-internal.XXXer.com',7246924},
 {'media-riak@prod-2017-internal.XXXer.com',7256024},
 {'media-riak@prod-2016-internal.XXXer.com',7279753},
 {'media-riak@prod-2015-internal.XXXer.com',7297350},
 {'media-riak@prod-2024-internal.XXXer.com',7299218},
 {'media-riak@prod-2014-internal.XXXer.com',7369359},
 {'media-riak@prod-2013-internal.XXXer.com',7378016},
 {'media-riak@prod-2025-internal.XXXer.com',7402411},
 {'media-riak@prod-2012-internal.XXXer.com',7436498},
 {'media-riak@prod-2011-internal.XXXer.com',7455697},
 {'media-riak@prod-2026-internal.XXXer.com',7493179},
 {'media-riak@prod-2009-internal.XXXer.com',7498511},
 {'media-riak@prod-2010-internal.XXXer.com',7511964},
 {'media-riak@prod-2008-internal.XXXer.com',7558773},
 {'media-riak@prod-2007-internal.XXXer.com',7576800},
 {'media-riak@prod-2006-internal.XXXer.com',7606920},
 {'media-riak@prod-2005-internal.XXXer.com',7623072},
 {'media-riak@prod-2027-internal.XXXer.com',7646077},
 {'media-riak@prod-2004-internal.XXXer.com',7656642},
 {'media-riak@prod-2028-internal.XXXer.com',7678676},
 {'media-riak@prod-2002-internal.XXXer.com',7683218},
 {'media-riak@prod-2003-internal.XXXer.com',7715771},
 {'media-riak@prod-2029-internal.XXXer.com',7767783},
 {'media-riak@prod-2001-internal.XXXer.com',7788377},
 {'media-riak@prod-2030-internal.XXXer.com',7806678},
 {'media-riak@prod-2031-internal.XXXer.com',7833097},
 {'media-riak@prod-2032-internal.XXXer.com',7872286},
 {'media-riak@prod-2033-internal.XXXer.com',7888006},
 {'media-riak@prod-2034-internal.XXXer.com',7923088},
 {'media-riak@prod-2035-internal.XXXer.com',7945616},
 {'media-riak@prod-2036-internal.XXXer.com',7962402},
 {'media-riak@prod-2037-internal.XXXer.com',7993250},
 {'media-riak@prod-2038-internal.XXXer.com',8088754},
 {'media-riak@prod-2039-internal.XXXer.com',8127035},
 {'media-riak@prod-2040-internal.XXXer.com',8200300}]
ok
(media-riak@prod-2000-internal.XXXer.com)6>
===== ALIVE Thu Jan 19 01:47:12 UTC 2012

===== ALIVE Thu Jan 19 02:02:12 UTC 2012

===== ALIVE Thu Jan 19 02:17:12 UTC 2012

===== ALIVE Thu Jan 19 02:32:12 UTC 2012

[SLF: many more 1/2 hour interval messages omitted...]

===== ALIVE Thu Jan 19 06:47:13 UTC 2012

===== ALIVE Thu Jan 19 07:02:13 UTC 2012
^R


x
y

asdflkjasdflkjasdf
^R


^G^M
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded^M
       (v)ersion (k)ill (D)b-tables (d)istribution^M
c
Eh?^M
^M
c


salkasdflkj
asdflkjasdf
asdflkjqwer
oiquwer
qwoeriua
asdflkj
qweoriu

-- end of log file --

Also, I didn't think anything much about it at the time (other than
"Hm, that seems a little bit high"), beam.smp's CPU utilization was
consistently high at that time:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 11504 riak     9046M 8893M run     59    0  25:00:10  32% beam.smp/23

... and ...

Total: 56 processes, 287 lwps, load averages: 5.11, 4.91, 4.97
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 11504 riak      48 2.9 0.0 0.0 0.0  34  15 0.1 19K   2 24K   0 beam.smp/20
 11504 riak      46 2.2 0.0 0.0 0.0  44 7.5 0.1 14K  15 17K   0 beam.smp/14
 11504 riak      40 2.3 0.0 0.0 0.0  51 6.1 0.1 16K   7 19K   0 beam.smp/19
 11504 riak      38 3.1 0.0 0.0 0.0  50 8.4 0.2 24K   5 28K   0 beam.smp/8
 11504 riak      38 1.9 0.0 0.0 0.0  52 7.6 0.2 14K   7 17K   0 beam.smp/13
 11504 riak      35 1.6 0.0 0.0 0.0  60 2.8 0.1  9K   1 11K   0 beam.smp/16
 11504 riak      33 2.7 0.0 0.0 0.0  51  13 0.1 18K   6 22K   0 beam.smp/7
 11504 riak      30 1.3 0.0 0.0 0.0  59  10 0.1  9K   4 12K   0 beam.smp/18
 11504 riak      27 1.2 0.0 0.0 0.0  69 2.6 0.0  7K   3  8K   0 beam.smp/15
 11504 riak      23 1.6 0.0 0.0 0.0  71 4.0 0.2 10K   6 11K   0 beam.smp/6
 11504 riak      19 1.2 0.0 0.0 0.0  75 3.1 1.0  7K   3  9K   0 beam.smp/11
 11504 riak      15 0.6 0.0 0.0 0.0  81 3.3 0.0  4K   1  4K   0 beam.smp/12
 11504 riak      10 0.5 0.0 0.0 0.0  86 2.6 0.0  3K   3  4K   0 beam.smp/9
 11504 riak     4.7 0.2 0.0 0.0 0.0  95 0.3 0.0  1K   2  1K   0 beam.smp/17
 11504 riak     2.2 0.2 0.0 0.0 0.0  97 0.5 0.0  1K   2  1K   0 beam.smp/21

I did not use "truss" or DTrace to try to see what it was doing at
that time, alas.

So, when I restarted riak on prod-2000 via kill {pid} ("riak stop"
didn't work), a lot of handoffs were triggered.  After a few minutes,
all outstanding transfers were finished and none remained
cluster-wide.

So, prod-2000 was in an odd state were "erl -remsh prod-2000..." would
work, but "riak attach" did not, and apparently its vnode weren't
working correctly?  Odd............

----

Find all current transfers:

rp(lists:flatten(element(1,(rpc:multicall(lists:sort([node() | nodes()]), ets, tab2list, [global_locks]))))).

----

Pids = [pid(0,31950,667), pid(0,31204,667), pid(0,3869,668)].
[begin
  {_, M} = process_info(P, messages),
  lists:sort([element(2, element(4, element(2, GE))) || GE <- M, element(1, element(4, element(2, GE))) == riak_kv_get_req_v1])
 end || P <- Pids].

[[{<<"bodies">>,<<"1323062665794_1623433739">>},
  {<<"bodies">>,<<"1323062665794_1623433739">>},
  {<<"bodies">>,<<"1323291523887_1261585781">>},
  {<<"bodies">>,<<"1323291523887_1261585781">>},
  {<<"bodies">>,<<"1323749748395_1663680247">>},
  {<<"bodies">>,<<"1323749748395_1663680247">>},
  {<<"bodies">>,
   <<"1325103665235_a6d971e73bc93292_11ef8a9f712ba82f">>},
  {<<"bodies">>,
   <<"1325103665235_a6d971e73bc93292_11ef8a9f712ba82f">>},
  {<<"bodies">>,<<"1325181319905_0223540070">>},
  {<<"bodies">>,<<"1325292430028_726416307">>},
  {<<"bodies">>,<<"1325292430028_726416307">>},
  {<<"bodies">>,
   <<"1325572004610_ad09ab45a4e270ab_305d714fc0460266">>},
  {<<"bodies">>,
   <<"1325572004610_ad09ab45a4e270ab_305d714fc0460266">>},
  {<<"bodies">>,<<"1325626956996_1424663528">>},
  {<<"bodies">>,<<"1325626956996_1424663528">>},
  {<<"bodies">>,<<"1325695381884_1709701352">>},
  {<<"bodies">>,<<"1325695381884_1709701352">>},
  {<<"bodies">>,<<"1326132704350_a4fae6cc359908b1_a"...>>},
  {<<"bodies">>,<<"1326231759930_144d4bfd752838"...>>},
  {<<"bodies">>,<<"1326494084453_5114271274">>},
  {<<"bodies">>,<<"1326612744480_193742"...>>},
  {<<"bodies">>,<<"1326612744480_19"...>>},
  {<<"bodies">>,<<"132661274448"...>>},
  {<<"bodies">>,<<"13266127"...>>},
  {<<"bodies">>,<<"1326"...>>},
  {<<"bodi"...>>,<<...>>},
  {<<...>>,...},
  {...}|...],
 [{<<"bodies">>,<<"1323897305170_6801318432">>},
  {<<"bodies">>,<<"1323897305170_6801318432">>},
  {<<"bodies">>,<<"1327040954254_0443391082_3a2190a">>},
  {<<"bodies">>,<<"1327277886646_7864500163_edc0f306">>},
  {<<"bodies">>,<<"1327295682946_4578250107_454b0b52">>},
  {<<"bodies">>,<<"1327311363065_8737892066_5ae266b1">>},
  {<<"bodies">>,<<"1327332554454_8735578690">>},
  {<<"bodies">>,<<"1327355380079_4177460385_5c04edbb">>},
  {<<"bodies">>,<<"1327388143845_0554939193_552f81b">>},
  {<<"bodies">>,<<"1327432439439_5035365936_10c70a78">>},
  {<<"bodies">>,<<"1327432439439_5035365936_10c70a78">>},
  {<<"bodies">>,<<"1327432439439_5035365936_10c70a78">>},
  {<<"bodies">>,<<"1327432439439_5035365936_10c70a78">>},
  {<<"bodies">>,
   <<"1327437260171_96c2d749a6e2d109_e2d87f6b66007"...>>},
  {<<"bodies">>,
   <<"1327437260171_96c2d749a6e2d109_e2d87f6b6"...>>},
  {<<"bodies">>,<<"1327437260171_96c2d749a6e2d109_e2d87"...>>},
  {<<"bodies">>,<<"1327437260171_96c2d749a6e2d109_e"...>>},
  {<<"bodies">>,<<"1327437260171_96c2d749a6e2d1"...>>},
  {<<"bodies">>,<<"1327437260171_96c2d749a6"...>>},
  {<<"bodies">>,<<"1327437260171_96c2d7"...>>},
  {<<"bodies">>,<<"1327437260171_96"...>>},
  {<<"bodies">>,<<"132745260716"...>>},
  {<<"bodies">>,<<"13274541"...>>},
  {<<"bodies">>,<<"1327"...>>},
  {<<"bodi"...>>,<<...>>},
  {<<...>>,...},
  {...}|...],
 []]
(media-riak@prod-2015-internal.XXXer.com)17> [length(LL) || LL <- v(16)].
[28050,7425,0]
(media-riak@prod-2015-internal.XXXer.com)19> file:write_file("/tmp/odd-vnode-backlog.keys.bin", term_to_binary(v(61))).

[begin
  {_, M} = process_info(P, messages),
  length(lists:sort([element(2, element(4, element(2, GE))) || GE <- M, element(1, element(4, element(2, GE))) == riak_kv_get_req_v1]))
 end || P <- Pids].

----

  dtrace -n 'profile-997 /execname == "beam.smp" && arg1/ { @[uaddr(arg1)] = count(); } tick-5s { trunc(@, 10); exit(0); }'
  dtrace -q -n 'pid$target:eleveldb.so::entry { self->level = 1; } pid$target:eleveldb.so::return { self->level = 0 } syscall::open:entry /self->level > 0/ { printf("%s\n", copyinstr(arg0)); }' -p `pgrep -u riak beam.smp`
  dtrace -q -n 'pid$target:eleveldb.so:eleveldb_*:entry { self->level = 1; } pid$target:eleveldb.so:eleveldb_*:return { self->level = 0 } syscall::open:entry /self->level > 0/ { printf("%s\n", copyinstr(arg0)); } tick-5sec { exit(0);}' -p `pgrep -u riak beam.smp` | awk -F/ '{print $6}' | sort | uniq -c | sort -nr | head -20
  dtrace -n 'pid$target:eleveldb.so:*DoCompactionWork*:entry' -p `pgrep -u riak beam.smp`
  dtrace -n 'pid$target:eleveldb.so:*OpenCompactionOutputFile*:entry' -p `pgrep -u riak beam.smp`

dtrace -q -n 'pid$target:eleveldb.so:*OpenCompactionOutputFile*:entry { self->level = 1; } pid$target:eleveldb.so:*OpenCompactionOutputFile*:return { self->level = 0 } syscall::open:entry /self->level > 0/ { printf("%s\n", copyinstr(arg0)); } tick-5sec { exit(0);}' -p `pgrep -u riak beam.smp`
/var/db/riak/leveldb/874902835628753407088729299272540518602623680512/352337.sst
/var/db/riak/leveldb/874902835628753407088729299272540518602623680512/352338.sst
/var/db/riak/leveldb/874902835628753407088729299272540518602623680512/352339.sst
[...]

dtrace -n 'syscall::*read*:entry { @[execname, lwp] = count} tick-5sec { exit(0);}'

----

repair action on prod-2003, 2014, 2015, and 2019
repair action on prod-2006

For example, on 2014:

    [root@prod-2014 /var/db/riak/leveldb]# ls -lt | head
    total 52123
    drwxr-xr-x 2 riak riak 62692 2012-01-27 02:00 984800907967112317930217318920151644104095170560
    drwxr-xr-x 2 riak riak 68096 2012-01-27 02:00 725041827894627619577609272480343529282435284992
    drwxr-xr-x 2 riak riak    18 2012-01-27 01:28 943410724878639481379527025806116285148995518464
    drwxr-xr-x 2 riak riak    17 2012-01-27 01:28 646543204795799826119403544160621296781384220672
    [...]

So, two hosed dirs.

Verify by using "grep error: $vnode_dir/LOG",
there should be at least one match, probably thousands.

    cd /var/db/riak/leveldb
    erl -pz /tmp/eleveldb+repair/ebin

At shell prompt, e.g.,

    eleveldb:repair("./984800907967112317930217318920151644104095170560", []).

Then, there's a second step, which takes a long long long time also.

1> {ok, DB} = eleveldb:open("./1455792646560079078679451688838485039110401556480", [write]).
{ok,<<>>}

Watch "iostat -x 1" and the 1455*/LOG file for compaction activity.
If compaction activity doesn't happen, or stops after only a few
minutes, then use the following (perhaps alternating put & delete
several times) to trigger enough writes to

10> [eleveldb:put(DB, <<X:32>>, <<"asdlfkjasdlfkjasdlkfjasdfasdlfkjasdlasdlfkjasdlasdlfkjasdlasdlfkjasdlasdlfkjasdl">>, []) || X <- lists:seq(1,20000)].
[ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
 ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]

11> [eleveldb:delete(DB, <<X:32>>, []) || X <- lists:seq(1,20000)].             [ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,ok,
 ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]

When compaction has finally finished (no "iostat -x" activity, and no
activity in the db's LOG file), then:

12> q().
ok

----

dtrace -qn 'pid$target:eleveldb.so:*DoCompactionWork*:entry { self->start = timestamp; } pid$target:eleveldb.so:*DoCompactionWork*:return /self->start > 0/ { @ = quantize((timestamp - self->start) / 1000000); self->start = 0; } END {printf("DoCompactionWork in milliseconds\n"); printa(@); }' -p `pgrep -u riak beam.smp`

dtrace -qn 'pid$target:eleveldb.so:eleveldb_*:entry { self->start = timestamp; } pid$target:eleveldb.so:eleveldb_*:return /self->start > 0/ { @[probefunc] = quantize((timestamp - self->start) / 1000000); self->start = 0; } END {printf("NIF function calls in milliseconds\n"); printa(@); } tick-300sec { exit(0); }' -p `pgrep -u riak beam.smp` ; date

---

Sunday 2012-01-29

Interesting, a new way for riak_sysmon_filter to get hosed.

(media-riak@prod-2101-internal.XXXer.com)10> process_info(pid(0,91,0)).
[{registered_name,riak_sysmon_filter},
 {current_function,{net_kernel,get_status,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,33004},
 {messages,[reset,reset,reset,reset,reset,reset,reset,reset,
            reset,reset,reset,reset,reset,reset,reset,reset,reset,reset,
            reset,reset,reset|...]},
 {links,[<0.79.0>,<0.90.0>]},
 {dictionary,[{'$ancestors',[riak_sysmon_sup,<0.89.0>]},
              {'$initial_call',{riak_sysmon_filter,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.88.0>},
 {total_heap_size,2584},
 {heap_size,2584},
 {stack_size,27},
 {reductions,118673663},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,0},
                      {minor_gcs,0}]},
 {suspending,[]}]
(media-riak@prod-2101-internal.XXXer.com)11> io:format("~s\n", [element(2,process_info(pid(0,91,0), backtrace))]).
Program counter: 0x0000000004435cf0 (net_kernel:get_status/3 + 216)
CP: 0x0000000000000000 (invalid)
arity = 0

0x0000000000e0dc48 Return addr 0x0000000004435a38 (net_kernel:get_node_info/1 + 424)
y(0)     'media-riak@prod-2038-internal.XXXer.com'
y(1)     <0.10111.2>

0x0000000000e0dc60 Return addr 0x0000000004436180 (net_kernel:get_nodes_info/2 + 96)
y(0)     normal
y(1)     {net_address,{{192,168,25,113},60765},"prod-2038-internal.XXXer.com",tcp,inet}
y(2)     <0.10111.2>
y(3)     up

0x0000000000e0dc88 Return addr 0x000000000487bed0 (riak_sysmon_filter:get_node_map/0 + 64)
y(0)     [{'media-riak@prod-2098-internal.XXXer.com',[{owner,<0.195.0>},{state,up},{address,{net_address,{{192,168,25,237},50572},"prod-2098-internal.XXXer.com",tcp,inet}},{type,normal},{in,16612},{out,24082}]},{'media-riak@prod-2012-internal.XXXer.com',[{owner,<0.10202.2>},{state,up},{address,{net_address,{{192,168,25,99},40263},"prod-2012-internal.XXXer.com",tcp,inet}},{type,normal},{in,14},{out,5}]},{'media-riak@prod-2032-internal.XXXer.com',[{owner,<0.10287.2>},{state,pending},{address,{net_address,{{192,168,25,20},62879},"prod-2032-internal.XXXer.com",tcp,inet}},{type,normal},{in,0},{out,0}]},{'media-riak@prod-2002-internal.XXXer.com',[{owner,<0.211.0>},{state,up},{address,{net_address,{{192,168,25,18},50827},"prod-2002-internal.XXXer.com",tcp,inet}},{type,normal},{in,25453},{out,24814}]},{'media-riak@prod-2035-internal.XXXer.com',[{owner,<0.10191.2>},{state,up},{address,{net_address,{{192,168,25,68},37453},"prod-2035-internal.XXXer.com",tcp,inet}},{type,normal},{in,10},{out,5}]},{'media-riak@prod-2015-internal.XXXer.com',[{owner,<0.10185.2>},{state,up},{address,{net_address,{{192,168,25,102},57790},"prod-2015-internal.XXXer.com",tcp,inet}},{type,normal},{in,11},{out,4}]},{'media-riak@prod-2034-internal.XXXer.com',[{owner,<0.10189.2>},{state,up},{address,{net_address,{{192,168,25,59},45894},"prod-2034-internal.XXXer.com",tcp,inet}},{type,normal},{in,11},{out,6}]},{'media-riak@prod-2104-internal.XXXer.com',[{owner,<0.9036.2>},{state,up},{address,{net_address,{{192,168,26,7},35922},"prod-2104-internal.XXXer.com",tcp,inet}},{type,normal},{in,64},{out,70}]},{'media-riak@prod-2017-internal.XXXer.com',[{owner,<0.10181.2>},{state,up},{address,{net_address,{{192,168,25,104},34530},"prod-2017-internal.XXXer.com",tcp,inet}},{type,normal},{in,5},{out,4}]},{'media-riak@prod-2037-internal.XXXer.com',[{owner,<0.10094.2>},{state,up},{address,{net_address,{{192,168,25,112},51065},"prod-2037-internal.XXXer.com",tcp,inet}},{type,normal},{in,15},{out,8}]}]
y(1)     ['media-riak@prod-2108-internal.XXXer.com','media-riak@prod-2028-internal.XXXer.com','media-riak@prod-2029-internal.XXXer.com','media-riak@prod-2039-internal.XXXer.com','media-riak@prod-2036-internal.XXXer.com','media-riak@prod-2106-internal.XXXer.com','media-riak@prod-2006-internal.XXXer.com','media-riak@prod-2026-internal.XXXer.com','media-riak@prod-2016-internal.XXXer.com','media-riak@prod-2023-internal.XXXer.com','media-riak@prod-2013-internal.XXXer.com','media-riak@prod-2103-internal.XXXer.com','media-riak@prod-2033-internal.XXXer.com','media-riak@prod-2003-internal.XXXer.com','media-riak@prod-2010-internal.XXXer.com','media-riak@prod-2040-internal.XXXer.com','media-riak@prod-2100-internal.XXXer.com','media-riak@prod-2020-internal.XXXer.com','media-riak@prod-2030-internal.XXXer.com','media-riak@prod-2021-internal.XXXer.com','media-riak@prod-2001-internal.XXXer.com','media-riak@prod-2031-internal.XXXer.com','media-riak@prod-2011-internal.XXXer.com']
y(2)     'media-riak@prod-2038-internal.XXXer.com'

0x0000000000e0dca8 Return addr 0x000000000487b560 (riak_sysmon_filter:handle_info/2 + 2600)
y(0)     Catch 0x000000000487bf48 (riak_sysmon_filter:get_node_map/0 + 184)

0x0000000000e0dcb8 Return addr 0x00000000043c5cc8 (gen_server:handle_msg/5 + 1680)
y(0)     []
y(1)     {0,nil}
y(2)     {state,0,30,0,2,{0,nil},[{{{192,168,25,98},46335},'media-riak@prod-2011-internal.XXXer.com'},{{{192,168,24,169},48853},'media-riak@prod-2031-internal.XXXer.com'},{{{192,168,25,4},45159},'media-riak@prod-2001-internal.XXXer.com'},{{{192,168,24,123},50343},'media-riak@prod-2021-internal.XXXer.com'},{{{192,168,24,168},47264},'media-riak@prod-2030-internal.XXXer.com'},{{{192,168,24,121},59550},'media-riak@prod-2020-internal.XXXer.com'},{{{192,168,25,248},39400},'media-riak@prod-2100-internal.XXXer.com'},{{{192,168,24,233},57098},'media-riak@prod-2000-internal.XXXer.com'},{{{192,168,25,115},34939},'media-riak@prod-2040-internal.XXXer.com'},{{{192,168,25,97},51083},'media-riak@prod-2010-internal.XXXer.com'},{{{192,168,25,41},38256},'media-riak@prod-2033-internal.XXXer.com'},{{{192,168,25,250},64683},'media-riak@prod-2103-internal.XXXer.com'},{{{192,168,25,100},33050},'media-riak@prod-2013-internal.XXXer.com'},{{{192,168,24,125},55403},'media-riak@prod-2023-internal.XXXer.com'},{{{192,168,25,103},38466},'media-riak@prod-2016-internal.XXXer.com'},{{{192,168,24,127},41757},'media-riak@prod-2026-internal.XXXer.com'},{{{192,168,25,90},45461},'media-riak@prod-2006-internal.XXXer.com'},{{{192,168,26,11},34804},'media-riak@prod-2106-internal.XXXer.com'},{{{192,168,25,87},41709},'media-riak@prod-2036-internal.XXXer.com'},{{{192,168,25,96},43607},'media-riak@prod-2009-internal.XXXer.com'},{{{192,168,25,240},45068},'media-riak@prod-2099-internal.XXXer.com'},{{{192,168,25,106},61231},'media-riak@prod-2019-internal.XXXer.com'},{{{192,168,25,114},58133},'media-riak@prod-2039-internal.XXXer.com'},{{{192,168,24,166},63433},'media-riak@prod-2029-internal.XXXer.com'},{{{192,168,25,105},62116},'media-riak@prod-2018-internal.XXXer.com'},{{{192,168,25,111},48904},'media-riak@prod-2028-internal.XXXer.com'},{{{192,168,26,18},44053},'media-riak@prod-2108-internal.XXXer.com'},{{{192,168,25,113},60765},'media-riak@prod-2038-internal.XXXer.com'},{{{192,168,25,95},56026},'media-riak@prod-2008-internal.XXXer.com'},{{{192,168,25,237},50572},'media-riak@prod-2098-internal.XXXer.com'},{{{192,168,25,99},40263},'media-riak@prod-2012-internal.XXXer.com'},{{{192,168,25,20},46256},'media-riak@prod-2032-internal.XXXer.com'},{{{192,168,25,249},54807},'media-riak@prod-2102-internal.XXXer.com'},{{{192,168,24,124},64396},'media-riak@prod-2022-internal.XXXer.com'},{{{192,168,25,18},50827},'media-riak@prod-2002-internal.XXXer.com'},{{{192,168,25,68},37453},'media-riak@prod-2035-internal.XXXer.com'},{{{192,168,25,89},65223},'media-riak@prod-2005-internal.XXXer.com'},{{{192,168,25,102},57790},'media-riak@prod-2015-internal.XXXer.com'},{{{192,168,24,199},47972},'media-riak@prod-2025-internal.XXXer.com'},{{{192,168,26,8},55805},'media-riak@prod-2105-internal.XXXer.com'},{{{192,168,25,101},44535},'media-riak@prod-2014-internal.XXXer.com'},{{{192,168,25,59},45894},'media-riak@prod-2034-internal.XXXer.com'},{{{192,168,24,122},36804},'media-riak@prod-2024-internal.XXXer.com'},{{{192,168,25,88},33156},'media-riak@prod-2004-internal.XXXer.com'},{{{192,168,26,7},35922},'media-riak@prod-2104-internal.XXXer.com'},{{{192,168,25,91},56857},'media-riak@prod-2007-internal.XXXer.com'},{{{192,168,25,104},34530},'media-riak@prod-2017-internal.XXXer.com'},{{{192,168,26,12},42432},'media-riak@prod-2107-internal.XXXer.com'},{{{192,168,25,112},51065},'media-riak@prod-2037-internal.XXXer.com'},{{{192,168,24,208},53930},'media-riak@prod-2027-internal.XXXer.com'}],{interval,#Ref<0.0.0.185>},false}

0x0000000000e0dcd8 Return addr 0x00000000030bc6e8 (proc_lib:init_p_do_apply/3 + 56)
y(0)     riak_sysmon_filter
y(1)     {state,0,30,0,2,{0,nil},[{{{192,168,25,98},46335},'media-riak@prod-2011-internal.XXXer.com'},{{{192,168,24,169},48853},'media-riak@prod-2031-internal.XXXer.com'},{{{192,168,25,4},45159},'media-riak@prod-2001-internal.XXXer.com'},{{{192,168,24,123},50343},'media-riak@prod-2021-internal.XXXer.com'},{{{192,168,24,168},47264},'media-riak@prod-2030-internal.XXXer.com'},{{{192,168,24,121},59550},'media-riak@prod-2020-internal.XXXer.com'},{{{192,168,25,248},39400},'media-riak@prod-2100-internal.XXXer.com'},{{{192,168,24,233},57098},'media-riak@prod-2000-internal.XXXer.com'},{{{192,168,25,115},34939},'media-riak@prod-2040-internal.XXXer.com'},{{{192,168,25,97},51083},'media-riak@prod-2010-internal.XXXer.com'},{{{192,168,25,41},38256},'media-riak@prod-2033-internal.XXXer.com'},{{{192,168,25,250},64683},'media-riak@prod-2103-internal.XXXer.com'},{{{192,168,25,100},33050},'media-riak@prod-2013-internal.XXXer.com'},{{{192,168,24,125},55403},'media-riak@prod-2023-internal.XXXer.com'},{{{192,168,25,103},38466},'media-riak@prod-2016-internal.XXXer.com'},{{{192,168,24,127},41757},'media-riak@prod-2026-internal.XXXer.com'},{{{192,168,25,90},45461},'media-riak@prod-2006-internal.XXXer.com'},{{{192,168,26,11},34804},'media-riak@prod-2106-internal.XXXer.com'},{{{192,168,25,87},41709},'media-riak@prod-2036-internal.XXXer.com'},{{{192,168,25,96},43607},'media-riak@prod-2009-internal.XXXer.com'},{{{192,168,25,240},45068},'media-riak@prod-2099-internal.XXXer.com'},{{{192,168,25,106},61231},'media-riak@prod-2019-internal.XXXer.com'},{{{192,168,25,114},58133},'media-riak@prod-2039-internal.XXXer.com'},{{{192,168,24,166},63433},'media-riak@prod-2029-internal.XXXer.com'},{{{192,168,25,105},62116},'media-riak@prod-2018-internal.XXXer.com'},{{{192,168,25,111},48904},'media-riak@prod-2028-internal.XXXer.com'},{{{192,168,26,18},44053},'media-riak@prod-2108-internal.XXXer.com'},{{{192,168,25,113},60765},'media-riak@prod-2038-internal.XXXer.com'},{{{192,168,25,95},56026},'media-riak@prod-2008-internal.XXXer.com'},{{{192,168,25,237},50572},'media-riak@prod-2098-internal.XXXer.com'},{{{192,168,25,99},40263},'media-riak@prod-2012-internal.XXXer.com'},{{{192,168,25,20},46256},'media-riak@prod-2032-internal.XXXer.com'},{{{192,168,25,249},54807},'media-riak@prod-2102-internal.XXXer.com'},{{{192,168,24,124},64396},'media-riak@prod-2022-internal.XXXer.com'},{{{192,168,25,18},50827},'media-riak@prod-2002-internal.XXXer.com'},{{{192,168,25,68},37453},'media-riak@prod-2035-internal.XXXer.com'},{{{192,168,25,89},65223},'media-riak@prod-2005-internal.XXXer.com'},{{{192,168,25,102},57790},'media-riak@prod-2015-internal.XXXer.com'},{{{192,168,24,199},47972},'media-riak@prod-2025-internal.XXXer.com'},{{{192,168,26,8},55805},'media-riak@prod-2105-internal.XXXer.com'},{{{192,168,25,101},44535},'media-riak@prod-2014-internal.XXXer.com'},{{{192,168,25,59},45894},'media-riak@prod-2034-internal.XXXer.com'},{{{192,168,24,122},36804},'media-riak@prod-2024-internal.XXXer.com'},{{{192,168,25,88},33156},'media-riak@prod-2004-internal.XXXer.com'},{{{192,168,26,7},35922},'media-riak@prod-2104-internal.XXXer.com'},{{{192,168,25,91},56857},'media-riak@prod-2007-internal.XXXer.com'},{{{192,168,25,104},34530},'media-riak@prod-2017-internal.XXXer.com'},{{{192,168,26,12},42432},'media-riak@prod-2107-internal.XXXer.com'},{{{192,168,25,112},51065},'media-riak@prod-2037-internal.XXXer.com'},{{{192,168,24,208},53930},'media-riak@prod-2027-internal.XXXer.com'}],{interval,#Ref<0.0.0.185>},false}
y(2)     riak_sysmon_filter
y(3)     <0.90.0>
y(4)     reset
y(5)     Catch 0x00000000043c5cc8 (gen_server:handle_msg/5 + 1680)

0x0000000000e0dd10 Return addr 0x0000000000672c08 (<terminate process normally>)
y(0)     Catch 0x00000000030bc708 (proc_lib:init_p_do_apply/3 + 88)

ok

---

Mon 2012-01-30: more riak_sysmon_filter mailbox suffiness

prod-2102:

Pid = pid(0,6777,3).
process_info(Pid).
{_, M} = process_info(Pid, messages).
[X || X <- M, X /= reset, element(1, X) /= monitor].

result on 2102: []
result on 2107: []
result on 2108: []

All 3 have stack backtraces that are identical.

See also: Jon's notes @ https://raw.github.com/gist/67963a1342931416e552/b5200f3264db0a4c091bdfd49d679cd393d6ec87/gistfile1.txt

----

-define(CHSTATE, #chstate_v2).
-record(chstate_v2, {
    nodename :: node(),          % the Node responsible for this chstate
    vclock   :: vclock:vclock(), % for this chstate object, entries are
                                 % {Node, Ctr}
    chring   :: chash:chash(),   % chash ring of {IndexAsInt, Node} mappings
    meta     :: dict(),          % dict of cluster-wide other data (primarily
                                 % bucket N-value, etc)

    clustername :: {node(), term()},
    next     :: [{integer(), node(), node(), [module()], awaiting | complete}],
    members  :: [{node(), {member_status(), vclock:vclock(), []}}],
    claimant :: node(),
    seen     :: [{node(), vclock:vclock()}],
    rvsn     :: vclock:vclock()
}).

{ok, R} = riak_core_ring_manager:get_my_ring().
Changes = riak_core_ring:pending_changes(R).
Ps = [{Idx, Owner, NewOwner} || {Idx, Owner, NewOwner, Mods, _St} <- Changes].
RunDu = fun(Nd, Cmd) ->
    Dunno = "-1\t",
    Out = try case rpc:call(Nd, os, cmd, [Cmd]) of
                  L when is_list(L) -> L;
                  _                 -> Dunno
              end
          catch _:_ -> Dunno
          end,
    case string:tokens(Out, "\t") of
        [Str|_] -> list_to_integer(Str);
        _       -> -1
    end
end.
Space = [begin
             Cmd = "du -k /var/db/riak/leveldb/" ++ integer_to_list(Idx),
             OwnK = RunDu(Owner, Cmd),
             NewK = RunDu(NewOwner, Cmd),
             {Idx, Owner, OwnK, NewOwner, NewK}
         end || {Idx, Owner, NewOwner} <- Ps].
%% sort by owner node
Space2 = lists:sort(fun(X, Y) -> element(2, X) < element(2, Y) end, Space).
%% calculate space ratios
rp([{Idx, Owner, kbyte_ratio_at_destination_PERCENT, NewK / OwnK * 100}|| {Idx, Owner, OwnK, NewOwner, NewK} <- Space2]).

---

dtrace -qn 'BEGIN { c = 0; } pid$target:eleveldb.so:eleveldb_iterator_move:entry {self->i = 1} syscall::read:entry{ c += arg2 ;} syscall::pread:entry{ c += arg2;} tick-10sec { printf("Iterator reads @ 10 sample: %d MBytes/sec\n", c / (1024*1024)/10); c = 0; }' -p `pgrep -u riak beam.smp`

----

Alright, back to prod-2003, which has got several LevelDB instances
that complain about "bad magic number" or other errors, e.g.:

[root@prod-2003 /var/db/riak/leveldb]# grep -i error */LOG | head
1043318063368056673053607043667580944695787782144/LOG:2012/01/28-21:47:27.868097 251 Compaction error: Invalid argument: not an sstable (bad magic number)
1043318063368056673053607043667580944695787782144/LOG:2012/01/28-21:48:26.053983 251 Compaction error: Invalid argument: not an sstable (bad magic number)
1043318063368056673053607043667580944695787782144/LOG:2012/01/28-21:48:39.765415
 251 Compaction error: Invalid argument: not an sstable (bad magic number)

[root@prod-2003 /var/db/riak/leveldb]# grep -i error */LOG | awk -F: '{ print $
1}' | uniq -c
  16928 1043318063368056673053607043667580944695787782144/LOG
  16926 1117534943388766586868637914078954691787690606592/LOG
  16925 1191751823409476500683668784490328438879593431040/LOG
  16926 1377294023461251285221245960518762806609350492160/LOG
    300 1414402463471606242128761395724449680155301904384/LOG
  16926 523799903223087276348390950787964715052468011008/LOG
  16926 598016783243797190163421821199338462144370835456/LOG
  16927 709342103274862060885968126816399082782225072128/LOG
  16926 820667423305926931608514432433459703420079308800/LOG

for i in `cat /tmp/slf.foo`; do echo Moving $i; mv $i ./slf.leveldb-troublemakers ; done
Moving 1043318063368056673053607043667580944695787782144 repair+open, screen 0
Moving 1117534943388766586868637914078954691787690606592 repair+open, screen 2
Moving 1191751823409476500683668784490328438879593431040 repair+open, screen 3
Moving 1377294023461251285221245960518762806609350492160 repair+open, screen 4
Moving 1414402463471606242128761395724449680155301904384 repair+open, screen 5
Moving 523799903223087276348390950787964715052468011008 repair+open, screen 6
Moving 598016783243797190163421821199338462144370835456 repair+open, screen 7
Moving 709342103274862060885968126816399082782225072128 repair+open, screen 8
Moving 820667423305926931608514432433459703420079308800 repair, screen 9

... of the remaining ones ...

[root@prod-2003 /var/db/riak/leveldb]# du -hs * | grep G
39G     1006209623357701716146091608461894071149836369920 seems ok
53G     1080426503378411629961122478873267818241739194368 seems ok
41G     1154643383399121543776153349284641565333642018816 seems ok
53G     1228860263419831457591184219696015312425544843264 seems ok
41G     1265968703430186414498699654901702185971496255488 seems ok
41G     1303077143440541371406215090107389059517447667712 seems ok
41G     1340185583450896328313730525313075933063399079936 seems ok
41G     1451510903481961199036276830930136553701253316608 seems ok
42G     560908343233442233255906385993651588598419423232 seems ok
41G     635125223254152147070937256405025335690322247680 seems ok
41G     672233663264507103978452691610712209236273659904 seems ok
43G     746450543285217017793483562022085956328176484352 seems ok
41G     783558983295571974700998997227772829874127896576 seems ok
45G     857775863316281888516029867639146576966030721024 seems ok
40G     894884303326636845423545302844833450511982133248 seems ok
46G     931992743336991802331060738050520324057933545472 seems ok
40G     969101183347346759238576173256207197603884957696 seems ok

All seem ok, so, let's see what happens when we try the "Moved"
directories....  Damn, every single one is complaining about "Invalid
argument: not an sstable (bad magic number)" or about missing .sst
files.  So, I'm repairing every single !@#$! one of them, and let the
!@#$! thing chug through 415GB of data for however long it takes.

NOTE: *Very* important to set the ulimit first!

ulimit -n 65000
erl -pz /opt/local/riak/lib/eleveldb-1.0.1/ebin

f().
{ok, DB0} = eleveldb:open("./", [write]).
B = <<1:(1024*1024*8)>>.
time().
[eleveldb:put(DB0, <<X:32>>, B, []) || X <- lists:seq(1,2000)].
[eleveldb:delete(DB0, <<X:32>>, []) || X <- lists:seq(1,2000)].
B2 = <<1:(8*1024)>>.
[eleveldb:put(DB0, <<X:32>>, B2, []) || X <- lists:seq(1,1*1024*1000)].
[eleveldb:delete(DB0, <<X:32>>, []) || X <- lists:seq(1,1*1024*1000)].
time().

-----

2012-02-13 Monday

[2/13/12 2:45:17 PM] Scott Fritchie: Where would I find the list of 12
nodes currently leaving?

[2/13/12 2:45:36 PM] Justin Shoffstall:
media-riak@prod-2002-internal.XXXer.com
media-riak@prod-2005-internal.XXXer.com
media-riak@prod-2008-internal.XXXer.com
media-riak@prod-2009-internal.XXXer.com
media-riak@prod-2010-internal.XXXer.com
media-riak@prod-2011-internal.XXXer.com
media-riak@prod-2012-internal.XXXer.com
media-riak@prod-2014-internal.XXXer.com
media-riak@prod-2015-internal.XXXer.com
media-riak@prod-2020-internal.XXXer.com
media-riak@prod-2031-internal.XXXer.com
media-riak@prod-2035-internal.XXXer.com

---

2012-02-18 Saturday morning, 12:30 AM or so CST

Owner:      media-riak@prod-2027-internal.XXXer.com
Next Owner: media-riak@prod-2106-internal.XXXer.com

Index: 174124218510127105489110888272838406638695088128
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

So, let's study these things:
   * I/O activity on 2106 (the destination
   * I/O activity on 2027 (the source)

Right now on 2106, there appear to be two transfers happening:

[root@prod-2106 ~]# env TZ=CST6CDT date ; netstat -na | grep 8099 | grep EST
February 18, 2012 12:32:56 AM CST
192.168.26.11.8099   192.168.24.208.43407 128872      0 70265      0 ESTABLISHED
192.168.26.11.8099   192.168.24.124.47353 128872      0 128872      0 ESTABLISHED

So, one of them is coming from prod-2027 (192.168.24.208), hooray
for the sanity check.  The other is coming from
prod-2022 (192.168.24.124).  So, for grins and giggles, lets
monitor the vnode handoff folding iterator on prod-2022 also.

Output from all 3 boxes are in:
    ~/b/notes/2011-12-07.XXXer.data/2012-02-18.transfers-data
Only one file has a "date" timestamp in it, alas, but all three
sets of output were started within a few seconds of each
other.

----

Screen experiment on prod-2106: making copies of eleveldb dirs
for iteration experiments....

rsync -rav --delete leveldb/944837972571345441260585311775565780285378265088 991937146430642117335508748767399119786008903680 999073384894171916740800178614646595467922636800 slf-basho-leveldb-exp-data/

NOTE: The rsync is now finished, see ~root/slf-basho-leveldb-exp-data/screen.1.out
for the output of the rsync commands.