Before starting, I want to layout that "I think, I am doing something wrong" and very much appreciate suggestions and help to identify my mistakes.
1 Google Cloud Platform VM instance of type n1-standard-8 (8 vCPUs + 30 GB RAM) attached with a local SSD.
I wanted to setup just a single node and do a basic benchmark, before going all in on setting up many nodes.
- May be this is a mistake?
- Should I be running multiple nodes?
- Is 1 PD + 1 TiKV not enough for just using the Raw KV API?
Setted up just 1 PD and 1 TiKV in docker containers.
sudo docker run -d --name pd1 \
-p 2379:2379 \
-p 2380:2380 \
-v /etc/localtime:/etc/localtime:ro \
-v /mnt/disks/localssd/data:/data \
pingcap/pd:latest \
--name="pd1" \
--data-dir="/data/pd1" \
--client-urls="http://0.0.0.0:2379" \
--advertise-client-urls="http://benchmark-tikv-server:2379" \
--peer-urls="http://0.0.0.0:2380" \
--advertise-peer-urls="http://benchmark-tikv-server:2380" \
--initial-cluster=""
sudo docker run -d --name tikv1 \
-p 20160:20160 \
-v /etc/localtime:/etc/localtime:ro \
-v /mnt/disks/localssd/data:/data \
pingcap/tikv:latest \
--addr="0.0.0.0:20160" \
--advertise-addr="10.132.15.225:20160" \
--data-dir="/data/tikv1" \
--pd="benchmark-tikv-server:2379"
/data
is mounted to a path on a local SSD in the host machine.
YCSB fork using the Java TiKV client (can be found at https://github.com/scriptnull/YCSB/tree/tikv ). It uses Raw KV API to get and set data.
When trying to do 1 million inserts, the following YCSB results could be observed.
[OVERALL], RunTime(ms), 896967
[OVERALL], Throughput(ops/sec), 1114.8682170024092
[TOTAL_GCS_G1_Young_Generation], Count, 10
[TOTAL_GC_TIME_G1_Young_Generation], Time(ms), 293
[TOTAL_GC_TIME_%_G1_Young_Generation], Time(%), 0.03266563875817059
[TOTAL_GCS_G1_Old_Generation], Count, 0
[TOTAL_GC_TIME_G1_Old_Generation], Time(ms), 0
[TOTAL_GC_TIME_%_G1_Old_Generation], Time(%), 0.0
[TOTAL_GCs], Count, 10
[TOTAL_GC_TIME], Time(ms), 293
[TOTAL_GC_TIME_%], Time(%), 0.03266563875817059
[CLEANUP], Operations, 20
[CLEANUP], AverageLatency(us), 2.2
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 7
[CLEANUP], 95thPercentileLatency(us), 2
[CLEANUP], 99thPercentileLatency(us), 7
[INSERT], Operations, 1000000
[INSERT], AverageLatency(us), 17890.24202
[INSERT], MinLatency(us), 4816
[INSERT], MaxLatency(us), 565247
[INSERT], 95thPercentileLatency(us), 24687
[INSERT], 99thPercentileLatency(us), 28735
[INSERT], Return=OK, 1000000
When trying to do 1 million runs + update (50% - 50%) the following results could be obtained.
[OVERALL], RunTime(ms), 466074
[OVERALL], Throughput(ops/sec), 2145.5820320378307
[TOTAL_GCS_G1_Young_Generation], Count, 10
[TOTAL_GC_TIME_G1_Young_Generation], Time(ms), 251
[TOTAL_GC_TIME_%_G1_Young_Generation], Time(%), 0.05385410900414955
[TOTAL_GCS_G1_Old_Generation], Count, 0
[TOTAL_GC_TIME_G1_Old_Generation], Time(ms), 0
[TOTAL_GC_TIME_%_G1_Old_Generation], Time(%), 0.0
[TOTAL_GCs], Count, 10
[TOTAL_GC_TIME], Time(ms), 251
[TOTAL_GC_TIME_%], Time(%), 0.05385410900414955
[READ], Operations, 499216
[READ], AverageLatency(us), 706.6268669273421
[READ], MinLatency(us), 376
[READ], MaxLatency(us), 305663
[READ], 95thPercentileLatency(us), 963
[READ], 99thPercentileLatency(us), 1145
[READ], Return=OK, 499216
[CLEANUP], Operations, 20
[CLEANUP], AverageLatency(us), 2.4
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 7
[CLEANUP], 95thPercentileLatency(us), 6
[CLEANUP], 99thPercentileLatency(us), 7
[UPDATE], Operations, 500784
[UPDATE], AverageLatency(us), 17746.126349883383
[UPDATE], MinLatency(us), 4352
[UPDATE], MaxLatency(us), 295167
[UPDATE], 95thPercentileLatency(us), 23647
[UPDATE], 99thPercentileLatency(us), 27807
[UPDATE], Return=OK, 500784
Had a tmux session with htop
+ iostat -x 1
running during the benchmarking process.
- Is it normal to have this much less CPU usage?
- RAM usage is also less. Tried increasing the block cache size to 10GB, still low usage - only 1 GB of it is getting used.
%util
iniostat -x 1
is way too low. It flickers between 0.0 to 0.40. For other key value stores I was able to achieve 90+%.- Will increasing this lead to better throughput?
- How do we tune TiKV to increase this factor?
- will decreasing
w_await
iniostat -x 1
cause more utilization and lead to better throughput? If so what tuning in TiKV can help with this?
Please confirm the block-cache-capacity set correctly, it should be [storage.block-cache] capacity = "10GB".
Please check the dataset size, if it is too small?
Please check if the bottleneck is the YCSB client machine.