scriptnull/benchmark-tikv.md

## benchmark-tikv.md

      
    Raw
  

              benchmark-tikv.md
            
          
    Before starting, I want to layout that "I think, I am doing something wrong" and very much appreciate suggestions and help to identify my mistakes.
Hardware

1 Google Cloud Platform VM instance of type n1-standard-8 (8 vCPUs + 30 GB RAM) attached with a local SSD.
I wanted to setup just a single node and do a basic benchmark, before going all in on setting up many nodes.

May be this is a mistake?
Should I be running multiple nodes?
Is 1 PD + 1 TiKV not enough for just using the Raw KV API?

Software

Setted up just 1 PD and 1 TiKV in docker containers.
sudo docker run -d --name pd1 \
  -p 2379:2379 \
  -p 2380:2380 \
  -v /etc/localtime:/etc/localtime:ro \
  -v /mnt/disks/localssd/data:/data \
  pingcap/pd:latest \
  --name="pd1" \
  --data-dir="/data/pd1" \
  --client-urls="http://0.0.0.0:2379" \
  --advertise-client-urls="http://benchmark-tikv-server:2379" \
  --peer-urls="http://0.0.0.0:2380" \
  --advertise-peer-urls="http://benchmark-tikv-server:2380" \
  --initial-cluster=""

sudo docker run -d --name tikv1 \
  -p 20160:20160 \
  -v /etc/localtime:/etc/localtime:ro \
  -v /mnt/disks/localssd/data:/data \
  pingcap/tikv:latest \
  --addr="0.0.0.0:20160" \
  --advertise-addr="10.132.15.225:20160" \
  --data-dir="/data/tikv1" \
  --pd="benchmark-tikv-server:2379" 
/data is mounted to a path on a local SSD in the host machine.
Benchmarking client

YCSB fork using the Java TiKV client (can be found at https://github.com/scriptnull/YCSB/tree/tikv ). It uses Raw KV API to get and set data.
Results

Insert workload

When trying to do 1 million inserts, the following YCSB results could be observed.
[OVERALL], RunTime(ms), 896967
[OVERALL], Throughput(ops/sec), 1114.8682170024092
[TOTAL_GCS_G1_Young_Generation], Count, 10
[TOTAL_GC_TIME_G1_Young_Generation], Time(ms), 293
[TOTAL_GC_TIME_%_G1_Young_Generation], Time(%), 0.03266563875817059
[TOTAL_GCS_G1_Old_Generation], Count, 0
[TOTAL_GC_TIME_G1_Old_Generation], Time(ms), 0
[TOTAL_GC_TIME_%_G1_Old_Generation], Time(%), 0.0
[TOTAL_GCs], Count, 10
[TOTAL_GC_TIME], Time(ms), 293
[TOTAL_GC_TIME_%], Time(%), 0.03266563875817059
[CLEANUP], Operations, 20
[CLEANUP], AverageLatency(us), 2.2
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 7
[CLEANUP], 95thPercentileLatency(us), 2
[CLEANUP], 99thPercentileLatency(us), 7
[INSERT], Operations, 1000000
[INSERT], AverageLatency(us), 17890.24202
[INSERT], MinLatency(us), 4816
[INSERT], MaxLatency(us), 565247
[INSERT], 95thPercentileLatency(us), 24687
[INSERT], 99thPercentileLatency(us), 28735
[INSERT], Return=OK, 1000000

Read + Update workload

When trying to do 1 million runs + update (50% - 50%) the following results could be obtained.
[OVERALL], RunTime(ms), 466074
[OVERALL], Throughput(ops/sec), 2145.5820320378307
[TOTAL_GCS_G1_Young_Generation], Count, 10
[TOTAL_GC_TIME_G1_Young_Generation], Time(ms), 251
[TOTAL_GC_TIME_%_G1_Young_Generation], Time(%), 0.05385410900414955
[TOTAL_GCS_G1_Old_Generation], Count, 0
[TOTAL_GC_TIME_G1_Old_Generation], Time(ms), 0
[TOTAL_GC_TIME_%_G1_Old_Generation], Time(%), 0.0
[TOTAL_GCs], Count, 10
[TOTAL_GC_TIME], Time(ms), 251
[TOTAL_GC_TIME_%], Time(%), 0.05385410900414955
[READ], Operations, 499216
[READ], AverageLatency(us), 706.6268669273421
[READ], MinLatency(us), 376
[READ], MaxLatency(us), 305663
[READ], 95thPercentileLatency(us), 963
[READ], 99thPercentileLatency(us), 1145
[READ], Return=OK, 499216
[CLEANUP], Operations, 20
[CLEANUP], AverageLatency(us), 2.4
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 7
[CLEANUP], 95thPercentileLatency(us), 6
[CLEANUP], 99thPercentileLatency(us), 7
[UPDATE], Operations, 500784
[UPDATE], AverageLatency(us), 17746.126349883383
[UPDATE], MinLatency(us), 4352
[UPDATE], MaxLatency(us), 295167
[UPDATE], 95thPercentileLatency(us), 23647
[UPDATE], 99thPercentileLatency(us), 27807
[UPDATE], Return=OK, 500784

System stats

Had a tmux session with htop + iostat -x 1 running during the benchmarking process.


Is it normal to have this much less CPU usage?
RAM usage is also less. Tried increasing the block cache size to 10GB, still low usage - only 1 GB of it is getting used.
%util in iostat -x 1 is way too low. It flickers between 0.0 to 0.40. For other key value stores I was able to achieve 90+%.

Will increasing this lead to better throughput?
How do we tune TiKV to increase this factor?


will decreasing w_await in iostat -x 1 cause more utilization and lead to better throughput? If so what tuning in TiKV can help with this?