tan-yue/kvm_mmu_note.md

## kvm_mmu_note.md

      
    Raw
  

              kvm_mmu_note.md
            
          
    KVM MMU Note

TL;DR

The note first lists KVM data structures related to mmu and then writes about what happens to these data structures at each step of creating and running a virtual machine. The goal of the note is to help implement prefilling EPT. The plan is to reuse the page fault handler to cause pre-faults to the pages in the memory dump. For purpose of prototyping, I plan to first use hardcode values instead of changing the KVM api. I imagine to prefill EPT requires userspace to pass in the list of guest physical addresses that should be pre-faulted.
Why prefill EPT?

Data Structures


mmu page header struct kvm_mmu_page

struct kvm_mmu_page {
	struct list_head link;
	struct hlist_node hash_link;
	gfn_t gfn;
	union kvm_mmu_page_role role;
	u64 *spt;
	gfn_t *gfns;
	bool unsync;
	int root_count;
	unsigned int unsync_children;
	struct kvm_rmap_head parent_ptes;
	unsigned long mmu_valid_gen;
	DECLARE_BITMAP(unsync_child_bitmap, 512);
#ifdef CONFIG_x86_32
	int clear_spte_count;
#endif
	atomic_t write_flooding_count;
};
struct kvm_rmap_head {
	unsigned long val;
};
union kvm_mmu_page_role {
	unsigned word;
	struct {
		unsigned level:4;
		unsigned cr4_pae:1;
		unsigned quadrant:2;
		unsigned direct:1;
		unsigned access:3;
		unsigned invalid:1;
		unsigned nxe:1;
		unsigned cr0_wp:1;
		unsigned smep_andnot_wp:1;
		unsigned smap_andnot_wp:1;
		unsigned ad_disabled:1;
		unsigned :7;
		unsigned smm:8;
	};
};

field link is used by KVM to maintain a list of active mmu pages
field hash_link is used by KVM to maintain a hash table of valid mmu pages
fields gfn and role are used to key the hash table above
field spt is the pointer to the mmu page
field gfns is not used when EPT is enabled.


per vcpu mmu context

struct kvm_mmu {
        void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
        unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
        u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
        int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err,
                          bool prefault);
        void (*inject_page_fault)(struct kvm_vcpu *vcpu,
                                  struct x86_exception *fault);
        gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
                            struct x86_exception *exception);
        gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
                               struct x86_exception *exception);
        int (*sync_page)(struct kvm_vcpu *vcpu,
                         struct kvm_mmu_page *sp);
        void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
        void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
                           u64 *spte, const void *pte);
        hpa_t root_hpa;
        union kvm_mmu_page_role base_role;
        u8 root_level;
        u8 shadow_root_level;
        u8 ept_ad;
        bool direct_map;

        /*
         * Bitmap; bit set = permission fault
         * Byte index: page fault error code [4:1]
         * Bit index: pte permissions in ACC_* format
         */
        u8 permissions[16];

        /*
        * The pkru_mask indicates if protection key checks are needed.  It
        * consists of 16 domains indexed by page fault error code bits [4:1],
        * with PFEC.RSVD replaced by ACC_USER_MASK from the page tables.
        * Each domain has 2 bits which are ANDed with AD and WD from PKRU.
        */
        u32 pkru_mask;

        u64 *pae_root;
        u64 *lm_root;

        /*
         * check zero bits on shadow page table entries, these
         * bits include not only hardware reserved bits but also
         * the bits spte never used.
         */
        struct rsvd_bits_validate shadow_zero_check;

        struct rsvd_bits_validate guest_rsvd_check;

        /* Can have large pages at levels 2..last_nonleaf_level-1. */
        u8 last_nonleaf_level;

        bool nx;

        u64 pdptrs[4]; /* pae */
};

The fields related to virtual machine memory management in the per virtual machine KVM context struct kvm

struct kvm {
	spinlock_t mmu_lock;
	struct mutex slots_lock;
	struct mm_struct *mm;
	struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
	// ... omitted unrelated fields 
	struct kvm_arch arch;
	// ... omitted unrelated fields 
};
// X86
struct kvm_arch {
	unsigned int n_used_mmu_pages;
	unsigned int n_requested_mmu_pages;
	unsigned int n_max_mmu_pages;
	unsigned int indirect_shadow_pages;
	unsigned long mmu_valid_gen;
	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
	struct list_head active_mmu_pages;
	// ... omitted unrelated fields
};
struct kvm_memslots {
	u64 generation;
	struct kvm_memory_slot memslots[KVM_MEM_SLOTS_NUM];
	short id_to_index[KVM_MEM_SLOTS_NUM];
	atomic_t lru_slots;
	int used_slots;
};

mm is the pointer pointing to the user address space tied to the virtual machine
memslots is the pointer pointing to the information about the memory slots the virtual machine has


memory slots

struct kvm_memory_slot {
	gfn_t base_gfn;
	unsigned long npages;
	unsigned long *dirty_bitmap;
	struct kvm_arch_memory_slot arch;
	unsigned long userspace_addr;
	u32 flags;
	short id;
};
// X86
struct kvm_arch_memory_slot {
	struct kvm_rmap_head *rmap[KVM_NR_PAGE_SIZES];
	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
	unsigned short *gfn_track[KVM_PAGE_TRACK_MAX];
};
struct kvm_lpage_info {
	int disallow_lpage;
}

field base_gfn is the starting guest physical page number of the memory slot
field userspace_addr is the starting userspace address of the memory slot. The userspace is the address space of the process within which the virtual machine is created and booted.
field lpage_info indicates whether a huge page size is allowed.
field rmap is a two dimensional array. It stores a reverse map for each page size from guest physical page number to its parent pte(s).


reverse map

struct kvm_rmap_head {
	unsigned long val;
};
#define PTE_LIST_EXT 3
struct pte_list_desc {
	u64 *sptes[PTE_LIST_EXT];
	struct pte_list_desc *more;
};
Traversal of the reverse map must be started by function rmap_get_first and must use rmap_get_next to continue. The encoding of struct kvm_rmap_head is that:

If the bit zero of rmap_head->val is clear, then it points to the only spte in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct pte_list_desc containing more mappings.


mmu memory caches

#define KVM_NR_MEM_OBJS 40
struct kvm_mmu_memory_cache {
	int nobjs;
	void *objects[KVM_NR_MEM_OBJS];
};
The mmu module maintains tthree memory caches:

mmu_page_header_cache
pte_list_desc_cache
mmu_page_cache

The first two are SLAB caches created during mmu module intialization. The last directly calls get_free_page.
Lifetime of the Data Structures Above

Load KVM

KVM consists of two loadable modules, namely kvm_intel and kvm. The former is platform dependent and depends on the latter. During the module initialization, the aforementioned SLAB caches are created.
Create a Virtual Machine

The user creates a virtual machine through KVM_CREATE_VM ioctl.
First, a struct kvm is allocated for the new VM.
Second, kvm->mm points to the current process' struct mm_struct.
Third, kvm_arch_init_vm is called to initialize architecture specific field kvm->arch. Inside kvm_arch_init_vm , the active_mmu_list is initialized and kvm_mmu_init_vm is called, which sets up page track (I don't think this is of our concern at least for now).
Create a Memory Slot

The user registers a memory slot with KVM through KVM_SET_USER_MEMORY_REGION ioctl on the VM file descriptor.
During this ioctl the lpage_info is allocated through kvzalloc. The struct is initialized by checking if the user memory region is page aligned for each page size. If the region is not page aligned, KVM disallows using the page size(s). In Linux, the only way to have file-backed huge-page-aligned memory is to use hugetlbfs, mmaping a file in hugetlbfs with flag MAP_HUGETLB. One can have a memory region backed by a file in /dev/shm which is mounted to always use huge pages but she won't be able to use the flag MAP_HUGETLB which forces mmap to do huge page alignment. Internally Linux forces the file system to be hugetlbfs when MAP_HUGETLB is set and MAP_ANONYMOUS is unset.
Function kvm_arch_create_memslot initializes struct kvm_arch_memory_slot. It allocates memory for the reverse map rmap for each page size (4K, 2M, 1G) through kvzalloc. Each guest page has an entry in the reverse map. gfn - base_gfn is the index into the reverse map.
Create a VCPU

The user creates a vcpu through KVM_CREATE_VCPU ioctl on the VM file descriptor. The creation process calls kvm_mmu_create followed by kvm_mmu_setup. With EPT enabled, kvm_mmu_setup does the real initialization and kvm_mmu_create is irrelevant. At this step, mmu.root_hpa is set to INVALID_PAGE and page fault handler is set to tdp_page_fault.
Enter the Guest Virtual Machine

The user starts the virtual machine through KVM_RUN ioctl. Internally, the ioctl reaches the function vcpu_enter_guest which before context switching to the guest calls kvm_mmu_reload. This is the place where a root page is allocated if mmu.root_hpais equal to INVALID_PAGE. This means that the first KVM_RUN ioctl always causes the root page of mmu to be allocated. At this time, however, the root page contains only zeros.
Handle a Page Fault

Every time there is a missing page table entry or a write to a write-protected page, a page fault is generated and function tdp_page_fault is called. The handler updates page tables entries and allocates mmu pages if needed.
In the case of the first entry into the guest, as soon as the guest needs to translate its virtual address, a page fault is generated by mmu because mmu's root page is empty. Assume four level paging and 4K page size is used (which is in our case). To handle the fault, KVM mmu allocates a level 3 mmu page, a level 2 mmu page, and then a level mmu 1 page. Page table entries are updated accordingly along the path. Level 1 is the mmu leaf level and its pages contain entries pointing to guest memory pages.
Updating page table entries and allocating new mmu pages update the VM's active_mmu_list,mmu_page_hash and the reverse map rmap.
A Possible Way to Prefill EPT

To prefill EPT means to preallocate mmu pages and filling these pages with the correct page tables entries. Prefilling should happen after memory slot registration and before the first entry into the guest.
Since KVM has complex management over mmu pages and page table entries, a good first step is to reuse the page fault handler. That is, we explicitly cause pre-faults to the pages in the memory dump.
// signatures of tdp_page_fault
static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, bool prefault);
To cause pre-faults, the caller of tdp_page_fault must supply guest physical addresses. Therefore, there should be a way to somehow pass into KVM the list of the guest physical addresses from userspace (maybe traversing mm_structs associated with the VM's memory slots can give us the list? But I suppose the traversal takes time). For the purpose of a prototype, I plan to first hardcode some values, for example, hardcode the first ten guest physical addresses that cause a page fault.
If we piggyback EPT prefilling with vcpu creation, it will require change of KVM_CREATE_VCPU ioctl. The alternative is to add another vm ioctl KVM_PREFILL_MMU which requires two parameters, the array of the guest physical addresses and the array size.
Changes


API change: KVM_CREATE_VCPU ioctl accepts struct kvm_vcpu_config as the argument:

#define KVM_CREATE_VCPU _IO(KVMIO, 0x41)
modified to be
#define KVM_CREATE_VCPU _IOW(KVMIO, 0x41, struct kvm_vcpu_config)
The definition of struct kvm_vcpu_config:
// in include/uapi/linux/kvm.h
struct kvm_vcpu_config {
	__u32 id; // the old argument
	__u32 ngpas; // the number of guest physical addresses
	__u64 gpas[0]; // the array of guest physical addresses
};

Internal function changes


kvm_vm_ioctl_create_vcpu in virt/kvm/kvm_main.c

static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id);
modified to be
static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, struct kvm_vcpu_config *kvm_vcpu_config,
                                                     u64 *gpas);
The VM ioctl entry function kvm_vm_ioctl is modified accordingly to supply this function with kernel address space's pointer to struct kvm_vcpu_config and array of guest physical addresses

kvm_arch_vcpu_setup in arch/x86/kvm/x86.c

int kvm_arch_vcpu_setup(struct kvm *kvm){
	//...
	kvm_mmu_setup(vcpu);
	//...
}
modified to be
int kvm_arch_vcpu_setup(struct kvm *kvm, u32 ngpas, u64 *gpas) {
	//...
	kvm_mmu_setup(vcpu);
	r = kvm_mmu_load(vcpu);
	if(r)
		goto out;
	kvm_mmu_prefill(vcpu, ngpas, gpas);
	//...
}

Add kvm_mmu_prefill in arch/x86/kvm/mmu.c

void kvm_mmu_prefill(struct kvm_vcpu *vcpu, u32 ngpas, gva_t *gpas);
The function is called by kvm_arch_vcpu_setup.