thiagokokada/low-latency-kvm.md

## low-latency-kvm.md

      
    Raw
  

              low-latency-kvm.md
            
          
    Low-latency guests in KVM

Summary

Obtaining a low-latency guests in KVM (i.e.: low DPC latency for Windows guests) can be difficult. Without it, you may hear cracks/pops in audio or freezes in the VM, so they can be very annoying, specially for gaming dedicated VMs.
This document summarizes some of my findings on this subject.
Configuring KVM for real-time workloads

Like described here, KVM is capable of running real-time workloads, but most of the times the default kernel configuration does not. So we need to make some adjustments in both kernel and KVM/Qemu configuration so we can get the lowest latency possible.
The most important part is to make sure that the VM gets some dedicated CPU cores. This means disallowing the kernel to schedule any tasks to the CPUs running the VM, and also manually scheduling the vCPUs to the real CPUs so we don't have any context switch.
For the Qemu/KVM part this is easy. For this example I am assuming a CPU with 6 cores, with 2 shared with the host, allocating 3 to the vCPUs (the reason why I am allocating 3 and not 4 will come later on) and 1 for VM emulation:
<vcpu placement='static'>3</vcpu>
<!-- ... -->
<cputune>
  <emulatorpin cpuset="2"/>
  <vcpupin vcpu="0" cpuset="3"/>
  <vcpupin vcpu="1" cpuset="4"/>
  <vcpupin vcpu="2" cpuset="5"/>
  <vcpusched vcpus='0-2' scheduler='fifo' priority='1'/>
</cputune>
<!-- ... -->
<cpu mode='host-passthrough'>
  <topology sockets='1' dies='1' cores='4' threads='1'/>
  <feature policy='require' name='tsc-deadline'/>
</cpu>
<!-- ... -->
<features>
  <!-- ... -->
  <pmu state='off'/>
</features>
So, what is happening in above? We are pinning each vCPU to a specific real CPU, and setting the scheduler to a real-time one (fifo). Using fifo+cpuset makes the core essentially dedicated to that vCPU, at least on the VM side. The remaining changes are probably not required but they're recommended in this article.
We also do the same for the core responsible to the emulation tasks. While this core generally doesn't have a big CPU load it seems highly sensitive to latency too, so this is why we also pin it to a dedicated core.
Now that our vCPUs/emulation cores are each pinned to a separate real CPU, we need to isolate the other tasks (i.e.: host tasks) from it. This can be either made using isolcpus or cpuset¹. I found the later on more flexible, since you can turn on/off without rebooting, so this is my choice:
$ sudo cset shield --cpu 2-5 --kthread on
# If you want to remove the shield
$ sudo cset shield --reset
This will disallow any user and mostly kernel tasks to run on those cores. As you can see, it is also important that the emulation core is isolated too. If you don't, mysterious latency spikes will happen.
¹: You may need this patch to make cpuset works with a running VM: https://rokups.github.io/#!pages/gaming-vm-performance.md#Update_1:_cpuset_patch
Tunning

The above will get you to the 95%, but latency spikes will still happen. Now, the rest of the tunning here is to get as close to 100% we can get.
Let's modify the above cputune configuration to also include a thread to I/O:
<vcpu placement='static'>3</vcpu>
<iothreads>1</iothreads>
<!-- ... -->
<cputune>
  <emulatorpin cpuset="2"/>
  <vcpupin vcpu="0" cpuset="3"/>
  <vcpupin vcpu="1" cpuset="4"/>
  <vcpupin vcpu="2" cpuset="5"/>
  <vcpusched vcpus='0-2' scheduler='fifo' priority='1'/>
  <iothreadpin iothread='1' cpuset='0-1'/>
</cputune>
<!-- ... -->
<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='threads' iothread='1'/>
  <!-- ... -->
</disk>
So, what is happening in above? We are creating 1 iothread (for high performance disks like NVMe you probably want to increase this), pinning it to either CPU 0 or 1 and also allocating this thread to the RAW disk block. An I/O thread, contrary to the vCPU or emulation core, it seems not really sensitive to latency, so it doesn't need a dedicated CPU (i.e.: there is no need to use isolcpus or cpuset here).
Another useful configuration is to make libvirt use Hugepages. By default the kernel can already use hugepages without any configuration thanks to Transparent Hugepages, but this is not deterministic so you may want to configure it manually. To do this first add the following to your XML config file:
<memoryBacking>
  <hugepages>
    <page size="2048" unit="KiB"/>
  </hugepages>
  <locked/>
  <nosharepages/>
</memoryBacking>
For 16GiB (16777216 KiB) of RAM allocated for the guest, you will need 8192 Hugepages with the default Hugepage size of 2048 KiB. To do it:
$ sudo sysctl vm.nr_hugepages=8192
And to return to default:
$ sudo sysctl vm.nr_hugepages=0
Keep in mind that theorically a bigger Hugepage size may bring even more performance benefits. For example, you can use 1 GiB (the maximum value) instead. If you want to try you need to adjust accordingly. However the default may be already good enough for your needs.
Next, it seems that the kernel's dirty page writeback uses a kthread that is not isolated by cpuset (shouldn't be a problem with isolcpus though). Each time it runs in a core that should be dedicated to a vCPU, this may generate a latency spike. To workaround this we can run:
$ sudo sh -c 'echo 0 > /sys/bus/workqueue/devices/writeback/cpumask'
This will limit kernel's dirty page writeback mechanism to the first CPU only (we could use all non-dedicated VM cores instead, but the above works and it is easier to setup). To go back to using all cores:
$ sudo sh -c 'echo fff > /sys/bus/workqueue/devices/writeback/cpumask'
Also, there is many things running in kernel that can affect latency. One of the most impactful ones is the vm.stat_interval. While running the VM we can increase this value temporary:
$ sudo sysctl vm.stat_interval=120
And set it to default afterwards:
$ sudo sysctl vm.stat_interval=1
Other things not related to KVM/Qemu but may also happen includes MSR Interrupts:

https://vfio.blogspot.com/2014/09/vfio-interrupts-and-how-to-coax-windows.html
https://forum.nixhaven.com/index.php?threads/using-the-msi-utility-to-reduce-msr-inturrupts-reduce-latency-significantly.19/

If you're desperate to get the lowest latency possible, a Linux-rt kernel may help (or at least a kernel compiled with real-time flags). I didn't needed to do this to get reasonable latency (~1ms in DPC latency), so I can't give any more tips to go on.
Remember: always bechmark!!!

Do not blindly trust the tips above or any other tips you find in the internet. It is very important to verify if your changes are making effect by running DPC Latency or LatencyMon tools, otherwise you can get worse performance than before. For example, a badly configured fifo scheduler can get worse performance than simply doing no tunning at all (there is a good reason the defaults are as it is).
Leave the tools above running for some minutes and also try to exercise the I/O (by download something or using the disk, preferably both at the same time). Your target is to get below 1 ms (or 1000 µs) most of the time. Playing some action games and making sure that there is no hiccups can be another way to verify that everything is alright.