Skip to content

Instantly share code, notes, and snippets.

@lewiselau
Created May 7, 2022 16:18
Show Gist options
  • Save lewiselau/9aa8bec87e3af682d7229660f965fbab to your computer and use it in GitHub Desktop.
Save lewiselau/9aa8bec87e3af682d7229660f965fbab to your computer and use it in GitHub Desktop.
skywalking-perf-review
<style> </style>
Round Duration Finance Core System throughput OAP HW Specs OAP Core GRPC Threads OAP Core Prepare Thread ES Concurrent Request OAP CPU Utilization OAP RAM Utilization OAP Thread Pool Utilization OAP L1  /Min Aggregation OAP L2 /Min Aggregation OAP Trace Analysis /Min OAP /5Min persistence Persistence preparing Persistence Execution ES CPU Utilization ES QPS 99%Write ES Disk Utilization ES RAM Utilization ES - Elastic Search Specs OAP & ES Performance summary & Tuning steps
1 0429 17:30 -18:30 5 CCU, 1Hours 4Ghz CPU, 4GB RAM 16 16 2 1Ghz 2GB 220 0.5M 0.2M 11k 11k 99% 10ms 99% 10s 70% CPU 10k 60% - 80% 70% 3 nodes x 2C/4GB RAM/20GB Disk 1 replicas, 500 IOPS Passed
2 0430 11:00 - 15:00 5 CCU, 10Hours 4Ghz CPU, 4GB RAM 16 16 2 1.8Ghz 3.2GB 240 0.75M 0.3M 17k 11k 99% 100ms 99% 10s 80% CPU 10k 65% - 100% 70% 3 nodes x 2C/4GB RAM/20GB Disk 1 replicas, 500 IOPS No any input data after 15:00 (4hours later), seems no any impact on Business system. Root cause: 1GB per CCU/Hour, then 20GB was consumed within 4 hours. Action: Scale out ES Disk size from 20GB to 100GB per node.  And Scale out OAP CPU from 6Ghz to 8Ghz.
3 0430 22:00 - 24:00 10 CCU, 2Hours 6Ghz CPU, 8GB RAM 16 16 4 4.8Ghz 4GB 350 eventloop5: 1 eventloop4: 200 DataCarrier: 6 Grpc: 16 Prepare: 2 Pool2-Prom: 11 1.5M 0.5M 35k 11k (85% Prep, 15% Exec) 99% 100ms 99% 10s 90% CPU 12.5k 25% - 40% 70% 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS Passed
4 0502 10:00 - 14:00 10 CCU, 4Hours 6Ghz CPU, 8GB RAM 16 16 4 4.8Ghz 4GB 152 - 340 eventloop5: 2 eventloop4: 15 DataCarrier: 26 Grpc: 16 Prepare: 2 Pool2-Prom: 11 1Million 0.5Million 40k 11k (85% Prep, 15% Exec) 99% 100ms 99% 10s 70% CPU 11.5k 36% - 72% 70% 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS Warning: Grpc Server thread pool full 2 hours later. org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 115 [grpc-default-worker-ELG-7-4] WARN  [] - Grpc server thread pool is full, rejecting the task. Action1: To avoid Disk full, activate housekeeping with policy 2 days retetion for  records data and 21 days for metrics data. Action2: Increase grpc thread from 16 to 32.
5 0502 21:15 - 24:00 20 CCU, 2Hours 6Ghz CPU, 8GB RAM 32 32 4 4.8Ghz 2.5GB 350 1.5M 0.5M 40k 11k 99% 100ms 99% 10s 60% 8.8k 47% - 70% 70% 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS Warning: "es_rejected_execution_exception", "rejected execution of coordinating operation. Caused by: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=142150483, replica_bytes=5063426, all_bytes=147213909, coordinating_operation_bytes=8817157, max_coordinating_and_primary_bytes=148655308]"}],"status":429} Action: Increase ES to 4C/4GM RAM
  0503 10:00 - 12:00 上次压测thread full 10:00重启 6Ghz CPU, 8GB RAM 32 32 4 7.2Ghz 3.5GB 200 2.6M 0.6M 75k 4k 99% 250ms 100% 10s 7% 0 62% 70% 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS Still meet the warning as upper. Action: Increase OAP Spec from 1 nodes to 2 nodes x 5Ghz/5gbRAM, 20 GRPC Threads per node;
6 0503 22:00 - 26:00 20 CCU, 4Hours 5Ghz CPU, 5GB RAM x 2 pods 20 * 2 pods 20 x 2 pods 4 * 2 4.8Ghz * 2 2.5GB * 2 120 (* 2) eventloop5: 5 eventloop4: 4 DataCarrier: 22 Grpc: 20 Prepare: 2 Pool2-Prom: 11 1.5M * 2 0.3M * 2 40k * 2 9k * 2 99% 100ms 99% 10s 75% CPU 8.8k   70% 3 nodes x 4C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS Warning: "es_rejected_execution_exception", "rejected execution of coordinating operation. Caused by: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=142150483, replica_bytes=5063426, all_bytes=147213909, coordinating_operation_bytes=8817157, max_coordinating_and_primary_bytes=148655308]"}],"status":429}
7 0504 09:00 - 13:00 20 CCU, 4Hours 5Ghz CPU, 5GB RAM x 2 pods 20 x 2 pods 9 * 2 pods 4 * 2 N/A N/A 140 (* 2) eventloop5: 5 eventloop4: 12 DataCarrier: 22 Grpc: 40 Prepare: 2 Pool2-Prom: 11 N/A N/A N/A N/A N/A N/A 50% 15k N/A 70% 3 nodes x 4C/8GB RAM/100GB Disk 1 replicas, 2500 IOPS Warning: thread pool full Action1: Increase grpc threads size to 40, pool queue to 20000. Warning again: thread pool full Action2: increase OAP CPU from 5Ghz to 8Ghz, and ES Disk from 100GB to 150GB.
8 0504 15:00 - 17:00 20 CCU, 4Hours 8Ghz CPU, 5GB RAM x 2 pods 40(20k queue) x 2 pods 9 * 2 pods 4 * 2     170 (* 2) eventloop5: 8 eventloop4: 7 DataCarrier: 35 Grpc: 40 Prepare: 2 Pool2-Prom: 17                       Warning again: thread pool full Action: increase OAP GRPC THREADS/PREPRAE THREAD and ES Concurrent request to 40.
9 0504 19:40 - 23:40 20 CCU, 4Hours 8Ghz CPU, 8GB RAM x 2 pods 40(20k queue) * 2 pods 40 * 2 pods 4 * 2 7Ghz x 2 5.5G x 2 230 (* 2) eventloop5: 8 eventloop4: 18 DataCarrier: 35 Grpc: 40 Prepare: 40 Pool2-Prom: 17 1.6M * 2 0.4M * 2 40k * 2 10k * 2 99% 500ms 99% 5s 75% 15k 39% - ? 70% 3 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 3750 IOPS Warning: ES bulk post timeout, over 15 seconds. Action1: Increase ES Flush interval from 10s to 30s, and Bulk actions from 5000 to 15000. Warning: ES bulk post timeout, over 15 seconds. Action2: Reduce Flush interval to 8s and Bulk actions to 3000. Action3: reduce ES Concurrent requests to 4.
10 0505 19:40 - 23:40 20 CCU, 4Hours 8Ghz CPU, 8GB RAM x 2 pods 40(20k queue) * 2 pods 40 x 2 pods 4 * 2 90% 6G * 2   1.6M * 2 0.4M * 2 35K * 2 11K * 2 99% 700ms 99% 10s 50% 12.5k 12% 55% 3 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 3750 IOPS Action: Housekeep 250GB of 2 days of log index, with Kibana. Change ES type to data analysis mode. Warning: Thread pool full, from 1 node. Action: Increase from 3 nodes to 6 nodes and 6 shards, to increase the max iops capacity.
11 0505 21:15 - 23:40 20 CCU, 4Hours 8Ghz CPU, 8GB RAM x 2 pods 40(20k queue) * 2 pods 40 x 2 pods 4 * 2 85% 6.5G * 2   1.4M * 2 0.3M * 2 33k * 2 11K * 2 99% 500ms 99% 1s - 5s-10s 40% 6 - 14.5k 7% - 16% 55% 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Warning: thread pool full, from 1 node Action1: increase grpc thread pool size to 60, queue to 40000 Warning again: thread pool full, from 1 node Action: Extend bulk post from 8s to every 60S, and from 3000 actions to 20000 actions.
12 0506 07:45 - 11:45 20 CCU, 4Hours 8Ghz CPU, 8GB RAM x 2 pods 60(30k queue) * 2 pods 60 x 2 pods 4 * 2 90% 6G * 2   1.8M * 2 0.5M * 2 40k * 2 11K * 2 (85% Prep, 15% Exec) 99% 250ms 99% 10s 40% 11K 15% - 26% 55% 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Passed
13 0506 23:00 - 24:00 40CCU, 1HOUR 8Ghz CPU, 8GB RAM x 2 pods 60(30k queue) * 2 pods 60 x 2 pods 4 * 2                           6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Passed, but with WARN:  com.linecorp.armeria.common.ContentTooLargeException: maxContentLength: 10485760, contentLength: 10572394, transferred: 10491988 Action: Pending to do content length tuning, only small part of data will be dropped while it still works for others.
14 0507 09:20 - 11:20 50CCU, 2HOUR 8Ghz CPU, 8GB RAM x 2 pods 60(30k queue) * 2 pods 60 x 2 pods 4 * 2 80% 5.5G * 2 300 (* 2) eventloop5: 6 eventloop4: 12 DataCarrier: 35 Grpc1: 60 Prepare: 60 Prom: 17 1.5Million * 2 0.3Million * 2 40k * 2 9K  * 2 (85% Prep, 15% Exec) 99% 250ms 99% 10s 48% 25k 29% - 41% 49% 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Warning: Thread pool full, from 1 node, 30 minutes later Action: Extend bulk post from 60s to every 90S, and from 20000 actions to 30000 actions.
15 0507 13:45 - 15:45 50CCU, 2HOUR 8Ghz CPU, 8GB RAM x 2 pods 60(30k queue) x 2 pods 60 x 2 pods 4 * 2 100% 3.3G * 2 300  (* 2) eventloop5: 6 eventloop4: 12 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17 2Million * 2 0.4Million * 2 52k * 2 8K  * 2 (85% Prep, 15% Exec) 99% 500ms 99% 10s 48% 25k 29% - 41% 49% 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Warning1: Thread pool full, from 1 node, 60 minutes later. Warning2: OAP CPU Utilization reached 100% and triggerred POD RESTART by K8S health check. Warning3: Always higher CPU Utilization from same one node of ES Action1: Extend bulk post from 90s to every 180s, and from 30000 actions to 60000 actions. Action2: Increase OAP pods from 2 to 3. Action3: Check with Kibana and found the Index Shard is in default value (5 shards for LOG/SEGMENT and 1 shards for Metrics), finally found the root cause is typo env parameter SW_STORAGE_ES_INDEX_SHARDS_NUMBE on OAP pod, then corrected parameter name to SW_STORAGE_ES_INDEX_SHARDS_NUMBER (6 shards).
16 0507 16:50 - 18:50 50CCU, 2HOUR 8Ghz CPU, 8GB RAM x 3 pods 60(30k queue) x 3 pods 60 x 3 pods 4 * 3 42% 3.5G * 3 270 (* 3) eventloop5: 8 eventloop4: 27 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17 0.6M * 3 0.2M * 3 18k * 3 9K  * 3 (85% Prep, 15% Exec) 99% 25A18:O180ms 99% 10s 50% 20k 62% - ? 52% 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Passed. No exact end-time disk size because Housekeeping job was triggered during the test time window. With WARN:  com.linecorp.armeria.common.ContentTooLargeException: maxContentLength: 10485760, contentLength: 10572394, transferred: 10491988 Action: Pending to do content length tuning, only small part of data will be dropped while it still works for others.
17 0507 19:30 - 20:30 100CCU, 1HOUR 8Ghz CPU, 8GB RAM x 3 pods 60(30k queue) x 3 pods 60 x 3 pods 4 * 3 43% 3.8G * 3 270 (* 3) eventloop5: 8 eventloop4: 27 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17 0.6M * 3 0.2M * 3 17k * 3 8.8K  * 3 (85% Prep, 15% Exec) 99% 250ms 99% 10s 44% 18k   59% 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS Seems the throughput decrease down  some bottleneck on application system end. Action: To deploy javaagent onto more microservices, and check what is the bottleneck from application system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment