Round | Duration | Finance Core System throughput | OAP HW Specs | OAP Core GRPC Threads | OAP Core Prepare Thread | ES Concurrent Request | OAP CPU Utilization | OAP RAM Utilization | OAP Thread Pool Utilization | OAP L1 /Min Aggregation | OAP L2 /Min Aggregation | OAP Trace Analysis /Min | OAP /5Min persistence | Persistence preparing | Persistence Execution | ES CPU Utilization | ES QPS 99%Write | ES Disk Utilization | ES RAM Utilization | ES - Elastic Search Specs | OAP & ES Performance summary & Tuning steps |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0429 17:30 -18:30 | 5 CCU, 1Hours | 4Ghz CPU, 4GB RAM | 16 | 16 | 2 | 1Ghz | 2GB | 220 | 0.5M | 0.2M | 11k | 11k | 99% 10ms | 99% 10s | 70% CPU | 10k | 60% - 80% | 70% | 3 nodes x 2C/4GB RAM/20GB Disk 1 replicas, 500 IOPS | Passed |
2 | 0430 11:00 - 15:00 | 5 CCU, 10Hours | 4Ghz CPU, 4GB RAM | 16 | 16 | 2 | 1.8Ghz | 3.2GB | 240 | 0.75M | 0.3M | 17k | 11k | 99% 100ms | 99% 10s | 80% CPU | 10k | 65% - 100% | 70% | 3 nodes x 2C/4GB RAM/20GB Disk 1 replicas, 500 IOPS | No any input data after 15:00 (4hours later), seems no any impact on Business system. Root cause: 1GB per CCU/Hour, then 20GB was consumed within 4 hours. Action: Scale out ES Disk size from 20GB to 100GB per node. And Scale out OAP CPU from 6Ghz to 8Ghz. |
3 | 0430 22:00 - 24:00 | 10 CCU, 2Hours | 6Ghz CPU, 8GB RAM | 16 | 16 | 4 | 4.8Ghz | 4GB | 350 eventloop5: 1 eventloop4: 200 DataCarrier: 6 Grpc: 16 Prepare: 2 Pool2-Prom: 11 | 1.5M | 0.5M | 35k | 11k (85% Prep, 15% Exec) | 99% 100ms | 99% 10s | 90% CPU | 12.5k | 25% - 40% | 70% | 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS | Passed |
4 | 0502 10:00 - 14:00 | 10 CCU, 4Hours | 6Ghz CPU, 8GB RAM | 16 | 16 | 4 | 4.8Ghz | 4GB | 152 - 340 eventloop5: 2 eventloop4: 15 DataCarrier: 26 Grpc: 16 Prepare: 2 Pool2-Prom: 11 | 1Million | 0.5Million | 40k | 11k (85% Prep, 15% Exec) | 99% 100ms | 99% 10s | 70% CPU | 11.5k | 36% - 72% | 70% | 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS | Warning: Grpc Server thread pool full 2 hours later. org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 115 [grpc-default-worker-ELG-7-4] WARN [] - Grpc server thread pool is full, rejecting the task. Action1: To avoid Disk full, activate housekeeping with policy 2 days retetion for records data and 21 days for metrics data. Action2: Increase grpc thread from 16 to 32. |
5 | 0502 21:15 - 24:00 | 20 CCU, 2Hours | 6Ghz CPU, 8GB RAM | 32 | 32 | 4 | 4.8Ghz | 2.5GB | 350 | 1.5M | 0.5M | 40k | 11k | 99% 100ms | 99% 10s | 60% | 8.8k | 47% - 70% | 70% | 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS | Warning: "es_rejected_execution_exception", "rejected execution of coordinating operation. Caused by: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=142150483, replica_bytes=5063426, all_bytes=147213909, coordinating_operation_bytes=8817157, max_coordinating_and_primary_bytes=148655308]"}],"status":429} Action: Increase ES to 4C/4GM RAM |
0503 10:00 - 12:00 | 上次压测thread full 10:00重启 | 6Ghz CPU, 8GB RAM | 32 | 32 | 4 | 7.2Ghz | 3.5GB | 200 | 2.6M | 0.6M | 75k | 4k | 99% 250ms | 100% 10s | 7% | 0 | 62% | 70% | 3 nodes x 2C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS | Still meet the warning as upper. Action: Increase OAP Spec from 1 nodes to 2 nodes x 5Ghz/5gbRAM, 20 GRPC Threads per node; | |
6 | 0503 22:00 - 26:00 | 20 CCU, 4Hours | 5Ghz CPU, 5GB RAM x 2 pods | 20 * 2 pods | 20 x 2 pods | 4 * 2 | 4.8Ghz * 2 | 2.5GB * 2 | 120 (* 2) eventloop5: 5 eventloop4: 4 DataCarrier: 22 Grpc: 20 Prepare: 2 Pool2-Prom: 11 | 1.5M * 2 | 0.3M * 2 | 40k * 2 | 9k * 2 | 99% 100ms | 99% 10s | 75% CPU | 8.8k | 70% | 3 nodes x 4C/4GB RAM/100GB Disk 1 replicas, 2500 IOPS | Warning: "es_rejected_execution_exception", "rejected execution of coordinating operation. Caused by: java.lang.RuntimeException: {"error":{"root_cause":[{"type":"es_rejected_execution_exception","reason":"rejected execution of coordinating operation [coordinating_and_primary_bytes=142150483, replica_bytes=5063426, all_bytes=147213909, coordinating_operation_bytes=8817157, max_coordinating_and_primary_bytes=148655308]"}],"status":429} | |
7 | 0504 09:00 - 13:00 | 20 CCU, 4Hours | 5Ghz CPU, 5GB RAM x 2 pods | 20 x 2 pods | 9 * 2 pods | 4 * 2 | N/A | N/A | 140 (* 2) eventloop5: 5 eventloop4: 12 DataCarrier: 22 Grpc: 40 Prepare: 2 Pool2-Prom: 11 | N/A | N/A | N/A | N/A | N/A | N/A | 50% | 15k | N/A | 70% | 3 nodes x 4C/8GB RAM/100GB Disk 1 replicas, 2500 IOPS | Warning: thread pool full Action1: Increase grpc threads size to 40, pool queue to 20000. Warning again: thread pool full Action2: increase OAP CPU from 5Ghz to 8Ghz, and ES Disk from 100GB to 150GB. |
8 | 0504 15:00 - 17:00 | 20 CCU, 4Hours | 8Ghz CPU, 5GB RAM x 2 pods | 40(20k queue) x 2 pods | 9 * 2 pods | 4 * 2 | 170 (* 2) eventloop5: 8 eventloop4: 7 DataCarrier: 35 Grpc: 40 Prepare: 2 Pool2-Prom: 17 | Warning again: thread pool full Action: increase OAP GRPC THREADS/PREPRAE THREAD and ES Concurrent request to 40. | |||||||||||||
9 | 0504 19:40 - 23:40 | 20 CCU, 4Hours | 8Ghz CPU, 8GB RAM x 2 pods | 40(20k queue) * 2 pods | 40 * 2 pods | 4 * 2 | 7Ghz x 2 | 5.5G x 2 | 230 (* 2) eventloop5: 8 eventloop4: 18 DataCarrier: 35 Grpc: 40 Prepare: 40 Pool2-Prom: 17 | 1.6M * 2 | 0.4M * 2 | 40k * 2 | 10k * 2 | 99% 500ms | 99% 5s | 75% | 15k | 39% - ? | 70% | 3 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 3750 IOPS | Warning: ES bulk post timeout, over 15 seconds. Action1: Increase ES Flush interval from 10s to 30s, and Bulk actions from 5000 to 15000. Warning: ES bulk post timeout, over 15 seconds. Action2: Reduce Flush interval to 8s and Bulk actions to 3000. Action3: reduce ES Concurrent requests to 4. |
10 | 0505 19:40 - 23:40 | 20 CCU, 4Hours | 8Ghz CPU, 8GB RAM x 2 pods | 40(20k queue) * 2 pods | 40 x 2 pods | 4 * 2 | 90% | 6G * 2 | 1.6M * 2 | 0.4M * 2 | 35K * 2 | 11K * 2 | 99% 700ms | 99% 10s | 50% | 12.5k | 12% | 55% | 3 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 3750 IOPS | Action: Housekeep 250GB of 2 days of log index, with Kibana. Change ES type to data analysis mode. Warning: Thread pool full, from 1 node. Action: Increase from 3 nodes to 6 nodes and 6 shards, to increase the max iops capacity. | |
11 | 0505 21:15 - 23:40 | 20 CCU, 4Hours | 8Ghz CPU, 8GB RAM x 2 pods | 40(20k queue) * 2 pods | 40 x 2 pods | 4 * 2 | 85% | 6.5G * 2 | 1.4M * 2 | 0.3M * 2 | 33k * 2 | 11K * 2 | 99% 500ms | 99% 1s - 5s-10s | 40% | 6 - 14.5k | 7% - 16% | 55% | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Warning: thread pool full, from 1 node Action1: increase grpc thread pool size to 60, queue to 40000 Warning again: thread pool full, from 1 node Action: Extend bulk post from 8s to every 60S, and from 3000 actions to 20000 actions. | |
12 | 0506 07:45 - 11:45 | 20 CCU, 4Hours | 8Ghz CPU, 8GB RAM x 2 pods | 60(30k queue) * 2 pods | 60 x 2 pods | 4 * 2 | 90% | 6G * 2 | 1.8M * 2 | 0.5M * 2 | 40k * 2 | 11K * 2 (85% Prep, 15% Exec) | 99% 250ms | 99% 10s | 40% | 11K | 15% - 26% | 55% | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Passed | |
13 | 0506 23:00 - 24:00 | 40CCU, 1HOUR | 8Ghz CPU, 8GB RAM x 2 pods | 60(30k queue) * 2 pods | 60 x 2 pods | 4 * 2 | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Passed, but with WARN: com.linecorp.armeria.common.ContentTooLargeException: maxContentLength: 10485760, contentLength: 10572394, transferred: 10491988 Action: Pending to do content length tuning, only small part of data will be dropped while it still works for others. | |||||||||||||
14 | 0507 09:20 - 11:20 | 50CCU, 2HOUR | 8Ghz CPU, 8GB RAM x 2 pods | 60(30k queue) * 2 pods | 60 x 2 pods | 4 * 2 | 80% | 5.5G * 2 | 300 (* 2) eventloop5: 6 eventloop4: 12 DataCarrier: 35 Grpc1: 60 Prepare: 60 Prom: 17 | 1.5Million * 2 | 0.3Million * 2 | 40k * 2 | 9K * 2 (85% Prep, 15% Exec) | 99% 250ms | 99% 10s | 48% | 25k | 29% - 41% | 49% | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Warning: Thread pool full, from 1 node, 30 minutes later Action: Extend bulk post from 60s to every 90S, and from 20000 actions to 30000 actions. |
15 | 0507 13:45 - 15:45 | 50CCU, 2HOUR | 8Ghz CPU, 8GB RAM x 2 pods | 60(30k queue) x 2 pods | 60 x 2 pods | 4 * 2 | 100% | 3.3G * 2 | 300 (* 2) eventloop5: 6 eventloop4: 12 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17 | 2Million * 2 | 0.4Million * 2 | 52k * 2 | 8K * 2 (85% Prep, 15% Exec) | 99% 500ms | 99% 10s | 48% | 25k | 29% - 41% | 49% | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Warning1: Thread pool full, from 1 node, 60 minutes later. Warning2: OAP CPU Utilization reached 100% and triggerred POD RESTART by K8S health check. Warning3: Always higher CPU Utilization from same one node of ES Action1: Extend bulk post from 90s to every 180s, and from 30000 actions to 60000 actions. Action2: Increase OAP pods from 2 to 3. Action3: Check with Kibana and found the Index Shard is in default value (5 shards for LOG/SEGMENT and 1 shards for Metrics), finally found the root cause is typo env parameter SW_STORAGE_ES_INDEX_SHARDS_NUMBE on OAP pod, then corrected parameter name to SW_STORAGE_ES_INDEX_SHARDS_NUMBER (6 shards). |
16 | 0507 16:50 - 18:50 | 50CCU, 2HOUR | 8Ghz CPU, 8GB RAM x 3 pods | 60(30k queue) x 3 pods | 60 x 3 pods | 4 * 3 | 42% | 3.5G * 3 | 270 (* 3) eventloop5: 8 eventloop4: 27 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17 | 0.6M * 3 | 0.2M * 3 | 18k * 3 | 9K * 3 (85% Prep, 15% Exec) | 99% 25A18:O180ms | 99% 10s | 50% | 20k | 62% - ? | 52% | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Passed. No exact end-time disk size because Housekeeping job was triggered during the test time window. With WARN: com.linecorp.armeria.common.ContentTooLargeException: maxContentLength: 10485760, contentLength: 10572394, transferred: 10491988 Action: Pending to do content length tuning, only small part of data will be dropped while it still works for others. |
17 | 0507 19:30 - 20:30 | 100CCU, 1HOUR | 8Ghz CPU, 8GB RAM x 3 pods | 60(30k queue) x 3 pods | 60 x 3 pods | 4 * 3 | 43% | 3.8G * 3 | 270 (* 3) eventloop5: 8 eventloop4: 27 DataCarrier: 35 Grpc1: 60 Prepare: 60 Pool2-Prom: 17 | 0.6M * 3 | 0.2M * 3 | 17k * 3 | 8.8K * 3 (85% Prep, 15% Exec) | 99% 250ms | 99% 10s | 44% | 18k | 59% | 6 nodes x 4C/8GB RAM/150GB Disk 1 replicas, 7500 IOPS | Seems the throughput decrease down some bottleneck on application system end. Action: To deploy javaagent onto more microservices, and check what is the bottleneck from application system. |
Created
May 7, 2022 16:18
-
-
Save lewiselau/9aa8bec87e3af682d7229660f965fbab to your computer and use it in GitHub Desktop.
skywalking-perf-review
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment