lewiselau/skywalking-perf-review.md

## skywalking-perf-review.md

      
    Raw
  

              skywalking-perf-review.md
            
          
<style>

</style>


Round
Duration
Finance Core   System throughput
OAP HW     Specs
OAP Core GRPC   Threads
OAP Core     Prepare Thread
ES Concurrent   Request
OAP CPU   Utilization
OAP RAM   Utilization
OAP Thread Pool   Utilization
OAP L1  /Min     Aggregation
OAP L2 /Min     Aggregation
OAP Trace   Analysis /Min
OAP /5Min     persistence
Persistence   preparing
Persistence   Execution
ES CPU      Utilization
ES QPS 99%Write
ES Disk   Utilization
ES RAM   Utilization
ES - Elastic   Search     Specs
OAP & ES     Performance summary & Tuning steps


1
0429        17:30 -18:30
5   CCU,      1Hours
4Ghz   CPU,      4GB RAM
16
16
2
1Ghz
2GB
220
0.5M
0.2M
11k
11k
99% 10ms
99% 10s
70% CPU
10k
60% - 80%
70%
3   nodes x 2C/4GB RAM/20GB Disk     1 replicas, 500 IOPS
Passed


2
0430        11:00 - 15:00
5   CCU,      10Hours
4Ghz   CPU,      4GB RAM
16
16
2
1.8Ghz
3.2GB
240
0.75M
0.3M
17k
11k
99% 100ms
99% 10s
80% CPU
10k
65% - 100%
70%
3   nodes x 2C/4GB RAM/20GB Disk     1 replicas, 500 IOPS
No   any input data after 15:00 (4hours later), seems no any impact on Business   system.      Root cause: 1GB per CCU/Hour, then 20GB was consumed within 4 hours.      Action: Scale out ES Disk size from 20GB to 100GB per node.  And Scale out OAP CPU from 6Ghz to   8Ghz.


3
0430        22:00 - 24:00
10   CCU,      2Hours
6Ghz   CPU,      8GB RAM
16
16
4
4.8Ghz
4GB
350     eventloop5: 1     eventloop4: 200     DataCarrier: 6     Grpc: 16     Prepare: 2     Pool2-Prom: 11
1.5M
0.5M
35k
11k     (85% Prep,     15% Exec)
99% 100ms
99% 10s
90% CPU
12.5k
25% - 40%
70%
3   nodes x 2C/4GB RAM/100GB Disk     1 replicas, 2500 IOPS
Passed


4
0502        10:00 - 14:00
10   CCU,      4Hours
6Ghz   CPU,      8GB RAM
16
16
4
4.8Ghz
4GB
152   - 340     eventloop5: 2     eventloop4: 15     DataCarrier: 26     Grpc: 16     Prepare: 2     Pool2-Prom: 11
1Million
0.5Million
40k
11k     (85% Prep,     15% Exec)
99% 100ms
99% 10s
70% CPU
11.5k
36% - 72%
70%
3   nodes x 2C/4GB RAM/100GB Disk     1 replicas, 2500 IOPS
Warning:   Grpc Server thread pool full 2 hours later.   org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 115   [grpc-default-worker-ELG-7-4] WARN  []   - Grpc server thread pool is full, rejecting the task.      Action1: To avoid Disk full, activate housekeeping with policy 2 days   retetion for  records data and 21 days   for metrics data.      Action2: Increase grpc thread from 16 to 32.


5
0502     21:15 - 24:00
20   CCU,      2Hours
6Ghz   CPU,      8GB RAM
32
32
4
4.8Ghz
2.5GB
350
1.5M
0.5M
40k
11k
99% 100ms
99% 10s
60%
8.8k
47% - 70%
70%
3   nodes x 2C/4GB RAM/100GB Disk     1 replicas, 2500 IOPS
Warning: "es_rejected_execution_exception",   "rejected execution of coordinating operation.     Caused by: java.lang.RuntimeException:   {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected   execution of coordinating operation [coordinating_and_primary_bytes=142150483,   replica_bytes=5063426, all_bytes=147213909,   coordinating_operation_bytes=8817157,   max_coordinating_and_primary_bytes=148655308]"}],"status":429}     Action: Increase ES to 4C/4GM RAM


0503     10:00 - 12:00
上次压测thread full 10:00重启
6Ghz   CPU,      8GB RAM
32
32
4
7.2Ghz
3.5GB
200
2.6M
0.6M
75k
4k
99% 250ms
100% 10s
7%
0
62%
70%
3   nodes x 2C/4GB RAM/100GB Disk     1 replicas, 2500 IOPS
Still   meet the warning as upper.      Action: Increase OAP Spec from 1 nodes to 2 nodes x 5Ghz/5gbRAM, 20 GRPC   Threads per node;


6
0503     22:00 - 26:00
20   CCU,      4Hours
5Ghz   CPU,      5GB RAM     x 2 pods
20        * 2 pods
20     x 2 pods
4   * 2
4.8Ghz * 2
2.5GB * 2
120   (* 2)     eventloop5: 5      eventloop4: 4     DataCarrier: 22     Grpc: 20     Prepare: 2     Pool2-Prom: 11
1.5M * 2
0.3M * 2
40k * 2
9k * 2
99% 100ms
99% 10s
75% CPU
8.8k
 
70%
3   nodes x 4C/4GB RAM/100GB Disk     1 replicas, 2500 IOPS
Warning:   "es_rejected_execution_exception", "rejected execution of   coordinating operation.     Caused by: java.lang.RuntimeException:   {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected   execution of coordinating operation   [coordinating_and_primary_bytes=142150483, replica_bytes=5063426,   all_bytes=147213909, coordinating_operation_bytes=8817157,   max_coordinating_and_primary_bytes=148655308]"}],"status":429}


7
0504     09:00 - 13:00
20   CCU,      4Hours
5Ghz   CPU,      5GB RAM     x 2 pods
20        x 2 pods
9        * 2 pods
4   * 2
N/A
N/A
140   (* 2)     eventloop5: 5     eventloop4: 12     DataCarrier: 22     Grpc: 40     Prepare: 2     Pool2-Prom: 11
N/A
N/A
N/A
N/A
N/A
N/A
50%
15k
N/A
70%
3   nodes x 4C/8GB RAM/100GB Disk     1 replicas, 2500 IOPS
Warning: thread pool full     Action1: Increase grpc threads size to 40, pool   queue to 20000.      Warning again: thread pool full     Action2: increase OAP CPU from 5Ghz to 8Ghz, and ES Disk from 100GB to   150GB.


8
0504     15:00 - 17:00
20   CCU,      4Hours
8Ghz   CPU,      5GB RAM     x 2 pods
40(20k   queue)     x 2 pods
9        * 2 pods
4   * 2
 
 
170   (* 2)     eventloop5: 8     eventloop4: 7     DataCarrier: 35     Grpc: 40     Prepare: 2     Pool2-Prom: 17
 
 
Warning   again: thread pool full     Action: increase OAP GRPC THREADS/PREPRAE THREAD and ES Concurrent request   to 40.


9
0504     19:40 - 23:40
20   CCU,      4Hours
8Ghz   CPU,      8GB RAM     x 2 pods
40(20k   queue)     * 2 pods
40        * 2 pods
4   * 2
7Ghz x 2
5.5G x 2
230   (* 2)     eventloop5: 8      eventloop4: 18      DataCarrier: 35     Grpc: 40     Prepare: 40     Pool2-Prom: 17
1.6M * 2
0.4M * 2
40k * 2
10k * 2
99% 500ms
99% 5s
75%
15k
39% - ?
70%
3   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 3750 IOPS
Warning:   ES bulk post timeout, over 15 seconds.      Action1: Increase ES Flush interval from 10s to 30s, and Bulk actions from   5000 to 15000.     Warning: ES bulk post timeout, over 15 seconds.      Action2: Reduce Flush interval to 8s and Bulk actions to 3000.      Action3: reduce ES Concurrent requests to 4.


10
0505     19:40 - 23:40
20   CCU,      4Hours
8Ghz   CPU,      8GB RAM     x 2 pods
40(20k   queue)     * 2 pods
40        x 2 pods
4   * 2
90%
6G * 2
 
1.6M * 2
0.4M * 2
35K * 2
11K * 2
99% 700ms
99% 10s
50%
12.5k
12%
55%
3   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 3750 IOPS
Action:   Housekeep 250GB of 2 days of log index, with Kibana. Change ES type to data   analysis mode.      Warning: Thread pool full, from 1 node.      Action: Increase from 3 nodes to 6 nodes and 6 shards, to increase the max   iops capacity.


11
0505     21:15 - 23:40
20   CCU,      4Hours
8Ghz   CPU,      8GB RAM     x 2 pods
40(20k   queue)     * 2 pods
40     x 2 pods
4   * 2
85%
6.5G * 2
 
1.4M * 2
0.3M * 2
33k * 2
11K * 2
99% 500ms
99%        1s - 5s-10s
40%
6 - 14.5k
7% - 16%
55%
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Warning: thread pool full, from 1 node     Action1: increase grpc thread pool size to 60, queue to 40000     Warning again: thread pool full, from 1   node     Action: Extend bulk post from 8s to every 60S, and from 3000 actions to 20000 actions.


12
0506     07:45 - 11:45
20   CCU,      4Hours
8Ghz   CPU,      8GB RAM     x 2 pods
60(30k   queue)     * 2 pods
60        x 2 pods
4   * 2
90%
6G * 2
 
1.8M * 2
0.5M * 2
40k * 2
11K   * 2     (85% Prep,     15% Exec)
99% 250ms
99% 10s
40%
11K
15% - 26%
55%
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Passed


13
0506     23:00 - 24:00
40CCU,     1HOUR
8Ghz   CPU,      8GB RAM     x 2 pods
60(30k   queue)     * 2 pods
60        x 2 pods
4   * 2
 
 
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Passed,   but with WARN:    com.linecorp.armeria.common.ContentTooLargeException:   maxContentLength: 10485760, contentLength: 10572394, transferred:   10491988     Action: Pending to do content length tuning, only small part of data will   be dropped while it still works for others.


14
0507     09:20 - 11:20
50CCU,     2HOUR
8Ghz   CPU,      8GB RAM     x 2 pods
60(30k   queue)     * 2 pods
60        x 2 pods
4   * 2
80%
5.5G * 2
300   (* 2)     eventloop5: 6      eventloop4: 12      DataCarrier: 35     Grpc1: 60     Prepare: 60     Prom: 17
1.5Million * 2
0.3Million * 2
40k * 2
9K  * 2     (85% Prep,     15% Exec)
99%   250ms
99% 10s
48%
25k
29% - 41%
49%
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Warning:   Thread pool full, from 1 node, 30 minutes later     Action: Extend bulk post from 60s to every 90S, and from 20000 actions to   30000 actions.


15
0507     13:45 - 15:45
50CCU,     2HOUR
8Ghz   CPU,      8GB RAM     x 2 pods
60(30k   queue)     x 2 pods
60        x 2 pods
4   * 2
100%
3.3G * 2
300  (* 2)     eventloop5: 6      eventloop4: 12      DataCarrier: 35     Grpc1: 60     Prepare: 60     Pool2-Prom: 17
2Million * 2
0.4Million * 2
52k * 2
8K  * 2     (85% Prep,     15% Exec)
99%   500ms
99% 10s
48%
25k
29% - 41%
49%
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Warning1:   Thread pool full, from 1 node, 60 minutes later.      Warning2: OAP CPU Utilization reached 100% and triggerred POD RESTART by   K8S health check.      Warning3: Always higher CPU Utilization from same one node of ES      Action1: Extend bulk post from 90s to every 180s, and from 30000 actions to   60000 actions.      Action2: Increase OAP pods from 2 to 3.      Action3: Check with Kibana and found the Index Shard is in default value (5   shards for LOG/SEGMENT and 1 shards for Metrics), finally found the root   cause is typo env parameter SW_STORAGE_ES_INDEX_SHARDS_NUMBE on OAP pod, then   corrected parameter name to SW_STORAGE_ES_INDEX_SHARDS_NUMBER (6 shards).


16
0507     16:50 - 18:50
50CCU,     2HOUR
8Ghz   CPU,      8GB RAM     x 3 pods
60(30k   queue)     x 3 pods
60        x 3 pods
4   * 3
42%
3.5G * 3
270   (* 3)     eventloop5: 8      eventloop4: 27      DataCarrier: 35     Grpc1: 60     Prepare: 60     Pool2-Prom: 17
0.6M * 3
0.2M * 3
18k * 3
9K  * 3     (85% Prep,     15% Exec)
99%   25A18:O180ms
99% 10s
50%
20k
62% - ?
52%
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Passed.     No exact end-time disk size because Housekeeping job was triggered during   the test time window.      With WARN:    com.linecorp.armeria.common.ContentTooLargeException:   maxContentLength: 10485760, contentLength: 10572394, transferred:   10491988     Action: Pending to do content length tuning, only small part of data will   be dropped while it still works for others.


17
0507     19:30 - 20:30
100CCU,     1HOUR
8Ghz   CPU,      8GB RAM     x 3 pods
60(30k   queue)     x 3 pods
60        x 3 pods
4   * 3
43%
3.8G * 3
270   (* 3)     eventloop5: 8      eventloop4: 27      DataCarrier: 35     Grpc1: 60     Prepare: 60     Pool2-Prom: 17
0.6M * 3
0.2M * 3
17k * 3
8.8K  * 3     (85% Prep,     15% Exec)
99%   250ms
99% 10s
44%
18k
 
59%
6   nodes x 4C/8GB RAM/150GB Disk     1 replicas, 7500 IOPS
Seems   the throughput decrease down  some   bottleneck on application system end.      Action: To deploy javaagent onto more microservices, and check what is the   bottleneck from application system.
Round	Duration	Finance Core System throughput	OAP HW Specs	OAP Core GRPC Threads	OAP Core Prepare Thread	ES Concurrent Request	OAP CPU Utilization	OAP RAM Utilization	OAP Thread Pool Utilization	OAP L1 /Min Aggregation	OAP L2 /Min Aggregation	OAP Trace Analysis /Min	OAP /5Min persistence	Persistence preparing	Persistence Execution	ES CPU Utilization	ES QPS 99%Write	ES Disk Utilization	ES RAM Utilization	ES - Elastic Search Specs	OAP & ES Performance summary & Tuning steps
1	0429 17:30 -18:30	5 CCU, 1Hours	4Ghz CPU, 4GB RAM	16	16	2	1Ghz	2GB	220	0.5M	0.2M	11k	11k	99% 10ms	99% 10s	70% CPU	10k	60% - 80%	70%	3 nodes x 2C/4GB RAM/20GB Disk 1 replicas, 500 IOPS	Passed
2	0430 11:00 - 15:00	5 CCU, 10Hours	4Ghz CPU, 4GB RAM	16	16	2	1.8Ghz	3.2GB	240	0.75M	0.3M	17k	11k	99% 100ms	99% 10s	80% CPU	10k	65% - 100%	70%	3 nodes x 2C/4GB RAM/20GB Disk 1 replicas, 500 IOPS	No any input data after 15:00 (4hours later), seems no any impact on Business system. Root cause: 1GB per CCU/Hour, then 20GB was consumed within 4 hours. Action: Scale out ES Disk size from 20GB to 100GB per node. And Scale out OAP CPU from 6Ghz to 8Ghz.
3	0430 22:00 - 24:00	10 CCU, 2Hours	6Ghz CPU, 8GB RAM	16	16	4	4.8Ghz	4GB	350 eventloop5: 1 eventloop4: 200 DataCarrier: 6 Grpc: 16 Prepare: 2 Pool2-Prom: 11	1.5M	0.5M	35k	11k (85% Prep, 15% Exec)	99% 100ms	99% 10s	90% CPU	12.5k	25% - 40%	70%	3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS	Passed
4	0502 10:00 - 14:00	10 CCU, 4Hours	6Ghz CPU, 8GB RAM	16	16	4	4.8Ghz	4GB	152 - 340 eventloop5: 2 eventloop4: 15 DataCarrier: 26 Grpc: 16 Prepare: 2 Pool2-Prom: 11	1Million	0.5Million	40k	11k (85% Prep, 15% Exec)	99% 100ms	99% 10s	70% CPU	11.5k	36% - 72%	70%	3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS	Warning: Grpc Server thread pool full 2 hours later. org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 115 [grpc-default-worker-ELG-7-4] WARN [] - Grpc server thread pool is full, rejecting the task. Action1: To avoid Disk full, activate housekeeping with policy 2 days retetion for records data and 21 days for metrics data. Action2: Increase grpc thread from 16 to 32.
5	0502 21:15 - 24:00	20 CCU, 2Hours	6Ghz CPU, 8GB RAM	32	32	4	4.8Ghz	2.5GB	350	1.5M	0.5M	40k	11k	99% 100ms	99% 10s	60%	8.8k	47% - 70%	70%	3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS	Warning: "es_rejected_execution_exception", "rejected execution of coordinating operation. Caused by: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=142150483, replica_bytes=5063426, all_bytes=147213909, coordinating_operation_bytes=8817157, max_coordinating_and_primary_bytes=148655308]"}],"status":429} Action: Increase ES to 4C/4GM RAM
	0503 10:00 - 12:00	上次压测thread full 10:00重启	6Ghz CPU, 8GB RAM	32	32	4	7.2Ghz	3.5GB	200	2.6M	0.6M	75k	4k	99% 250ms	100% 10s	7%	0	62%	70%	3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS	Still meet the warning as upper. Action: Increase OAP Spec from 1 nodes to 2 nodes x 5Ghz/5gbRAM, 20 GRPC Threads per node;
6	0503 22:00 - 26:00	20 CCU, 4Hours	5Ghz CPU, 5GB RAM x 2 pods	20 * 2 pods	20 x 2 pods	4 * 2	4.8Ghz * 2	2.5GB * 2	120 (* 2) eventloop5: 5 eventloop4: 4 DataCarrier: 22 Grpc: 20 Prepare: 2 Pool2-Prom: 11	1.5M * 2	0.3M * 2	40k * 2	9k * 2	99% 100ms	99% 10s	75% CPU	8.8k		70%	3 nodes x 4C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS	Warning: "es_rejected_execution_exception", "rejected execution of coordinating operation. Caused by: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=142150483, replica_bytes=5063426, all_bytes=147213909, coordinating_operation_bytes=8817157, max_coordinating_and_primary_bytes=148655308]"}],"status":429}
7	0504 09:00 - 13:00	20 CCU, 4Hours	5Ghz CPU, 5GB RAM x 2 pods	20 x 2 pods	9 * 2 pods	4 * 2	N/A	N/A	140 (* 2) eventloop5: 5 eventloop4: 12 DataCarrier: 22 Grpc: 40 Prepare: 2 Pool2-Prom: 11	N/A	N/A	N/A	N/A	N/A	N/A	50%	15k	N/A	70%	3 nodes x 4C/8GB RAM/100GB Disk 1 replicas, 2500 IOPS	Warning: thread pool full Action1: Increase grpc threads size to 40, pool queue to 20000. Warning again: thread pool full Action2: increase OAP CPU from 5Ghz to 8Ghz, and ES Disk from 100GB to 150GB.
8	0504 15:00 - 17:00	20 CCU, 4Hours	8Ghz CPU, 5GB RAM x 2 pods	40(20k queue) x 2 pods	9 * 2 pods	4 * 2			170 (* 2) eventloop5: 8 eventloop4: 7 DataCarrier: 35 Grpc: 40 Prepare: 2 Pool2-Prom: 17												Warning again: thread pool full Action: increase OAP GRPC THREADS/PREPRAE THREAD and ES Concurrent request to 40.
9	0504 19:40 - 23:40	20 CCU, 4Hours	8Ghz CPU, 8GB RAM x 2 pods	40(20k queue) * 2 pods	40 * 2 pods	4 * 2	7Ghz x 2	5.5G x 2	230 (* 2) eventloop5: 8 eventloop4: 18 DataCarrier: 35 Grpc: 40 Prepare: 40 Pool2-Prom: 17	1.6M * 2	0.4M * 2	40k * 2	10k * 2	99% 500ms	99% 5s	75%	15k	39% - ?	70%	3 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 3750 IOPS	Warning: ES bulk post timeout, over 15 seconds. Action1: Increase ES Flush interval from 10s to 30s, and Bulk actions from 5000 to 15000. Warning: ES bulk post timeout, over 15 seconds. Action2: Reduce Flush interval to 8s and Bulk actions to 3000. Action3: reduce ES Concurrent requests to 4.
10	0505 19:40 - 23:40	20 CCU, 4Hours	8Ghz CPU, 8GB RAM x 2 pods	40(20k queue) * 2 pods	40 x 2 pods	4 * 2	90%	6G * 2		1.6M * 2	0.4M * 2	35K * 2	11K * 2	99% 700ms	99% 10s	50%	12.5k	12%	55%	3 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 3750 IOPS	Action: Housekeep 250GB of 2 days of log index, with Kibana. Change ES type to data analysis mode. Warning: Thread pool full, from 1 node. Action: Increase from 3 nodes to 6 nodes and 6 shards, to increase the max iops capacity.
11	0505 21:15 - 23:40	20 CCU, 4Hours	8Ghz CPU, 8GB RAM x 2 pods	40(20k queue) * 2 pods	40 x 2 pods	4 * 2	85%	6.5G * 2		1.4M * 2	0.3M * 2	33k * 2	11K * 2	99% 500ms	99% 1s - 5s-10s	40%	6 - 14.5k	7% - 16%	55%	6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Warning: thread pool full, from 1 node Action1: increase grpc thread pool size to 60, queue to 40000 Warning again: thread pool full, from 1 node Action: Extend bulk post from 8s to every 60S, and from 3000 actions to 20000 actions.
12	0506 07:45 - 11:45	20 CCU, 4Hours	8Ghz CPU, 8GB RAM x 2 pods	60(30k queue) * 2 pods	60 x 2 pods	4 * 2	90%	6G * 2		1.8M * 2	0.5M * 2	40k * 2	11K * 2 (85% Prep, 15% Exec)	99% 250ms	99% 10s	40%	11K	15% - 26%	55%	6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Passed
13	0506 23:00 - 24:00	40CCU, 1HOUR	8Ghz CPU, 8GB RAM x 2 pods	60(30k queue) * 2 pods	60 x 2 pods	4 * 2														6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Passed, but with WARN: com.linecorp.armeria.common.ContentTooLargeException: maxContentLength: 10485760, contentLength: 10572394, transferred: 10491988 Action: Pending to do content length tuning, only small part of data will be dropped while it still works for others.
14	0507 09:20 - 11:20	50CCU, 2HOUR	8Ghz CPU, 8GB RAM x 2 pods	60(30k queue) * 2 pods	60 x 2 pods	4 * 2	80%	5.5G * 2	300 (* 2) eventloop5: 6 eventloop4: 12 DataCarrier: 35 Grpc1: 60 Prepare: 60 Prom: 17	1.5Million * 2	0.3Million * 2	40k * 2	9K * 2 (85% Prep, 15% Exec)	99% 250ms	99% 10s	48%	25k	29% - 41%	49%	6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Warning: Thread pool full, from 1 node, 30 minutes later Action: Extend bulk post from 60s to every 90S, and from 20000 actions to 30000 actions.
15	0507 13:45 - 15:45	50CCU, 2HOUR	8Ghz CPU, 8GB RAM x 2 pods	60(30k queue) x 2 pods	60 x 2 pods	4 * 2	100%	3.3G * 2	300 (* 2) eventloop5: 6 eventloop4: 12 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17	2Million * 2	0.4Million * 2	52k * 2	8K * 2 (85% Prep, 15% Exec)	99% 500ms	99% 10s	48%	25k	29% - 41%	49%	6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Warning1: Thread pool full, from 1 node, 60 minutes later. Warning2: OAP CPU Utilization reached 100% and triggerred POD RESTART by K8S health check. Warning3: Always higher CPU Utilization from same one node of ES Action1: Extend bulk post from 90s to every 180s, and from 30000 actions to 60000 actions. Action2: Increase OAP pods from 2 to 3. Action3: Check with Kibana and found the Index Shard is in default value (5 shards for LOG/SEGMENT and 1 shards for Metrics), finally found the root cause is typo env parameter SW_STORAGE_ES_INDEX_SHARDS_NUMBE on OAP pod, then corrected parameter name to SW_STORAGE_ES_INDEX_SHARDS_NUMBER (6 shards).
16	0507 16:50 - 18:50	50CCU, 2HOUR	8Ghz CPU, 8GB RAM x 3 pods	60(30k queue) x 3 pods	60 x 3 pods	4 * 3	42%	3.5G * 3	270 (* 3) eventloop5: 8 eventloop4: 27 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17	0.6M * 3	0.2M * 3	18k * 3	9K * 3 (85% Prep, 15% Exec)	99% 25A18:O180ms	99% 10s	50%	20k	62% - ?	52%	6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Passed. No exact end-time disk size because Housekeeping job was triggered during the test time window. With WARN: com.linecorp.armeria.common.ContentTooLargeException: maxContentLength: 10485760, contentLength: 10572394, transferred: 10491988 Action: Pending to do content length tuning, only small part of data will be dropped while it still works for others.
17	0507 19:30 - 20:30	100CCU, 1HOUR	8Ghz CPU, 8GB RAM x 3 pods	60(30k queue) x 3 pods	60 x 3 pods	4 * 3	43%	3.8G * 3	270 (* 3) eventloop5: 8 eventloop4: 27 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17	0.6M * 3	0.2M * 3	17k * 3	8.8K * 3 (85% Prep, 15% Exec)	99% 250ms	99% 10s	44%	18k		59%	6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS	Seems the throughput decrease down some bottleneck on application system end. Action: To deploy javaagent onto more microservices, and check what is the bottleneck from application system.