jlevon/gist:6b5c6e9f19345d1617debd12f928afb4

## gistfile1.txt
We'll have two main PCIDs: 0 (kernel) and 1 (user).

Brief summary of PCID
---------------------

PCID is enabled by %cr4.PCIDE. It's available in at least Sandy Bridge, but INVPCID came later: Haswell I think.

PCID lives is in the MMU_PAGEMASK bits of %cr3. A zero PCID is also used when %cr4.PCIDE is 0.

With a non-zero PCID in %cr3, TLB entries are tagged with the PCID, and only TLB entries matching the current PCID are used.
(Global TLB entries are apparently not tagged, so are used regardless.)

mov to %cr3 is modified when %cr4.PCIDE is 1. If the top bit is 0, the PCID of the *source* operand is invalidated - NOT the current %cr3 PCID. Other PCIDs are not invalidated. Thus, a mov to %cr3 will NOT invalidate any "current" mappings.

If the top bit is 1, nothing is invalidated.

INVPCID has 4 forms, similar to INVVPID.  You can shootdown an individual PCID mapping, a whole PCID, all-PCIDs-except-global, and everything (like a - supposedly faster - twiddle of %cr4.PGE).

How we're planning to use it
----------------------------

With KPTI, we can no longer use PT_GLOBAL to keep our kernel TLB entries around, since when we return to userspace, we cannot allow it to speculate across KERNELBASE. In particular, on user->kernel and kernel->user, we have to switch %cr3, and hence dump all the TLB.

We want to use PCID to mitigate this somewhat: namely, that a switch into - and out of - the kernel will at least keep our userspace mappings in the TLB. (Presuming we haven't context switched to another process of course).

To do that we'll have PCID1 for the kernel, and PCID2 for userspace, ensuring that we don't flush PCID2 unless it's necessary.

FIXME: tlb shootdown/inval
FIXME: hat_switch
FIXME: lack of INVVPCID

 - pcide changes PAT behavoiur; problem? FIXME

cr4 audit
---------

pat_sync() cr4.pge twiddle to flush all
flush_all_tlb_entries() cr4.pge twiddle

mmu_tlbflush_entry audit
-------------------------

flush a single page via invlpg

kboot_mmu - presume this is all pre-PCIDE. Should validate.

i86pc/vm/hat_kdi.c: used prior to kernel (bootload kmdb). Therefore
can't presume PCID. Always a kernel-only mapping, but maybe setup prior
to PCID (hat_kdi_init). Always flushed after use though, so this seems
OK.

x86pte_mapin - only if !kpm. ASSERT()?
x86pte_set - local flush. if > KERNELBASE flush kpcid, else upcid
	INVPCID
x86pte_copy - only !kpm

hati_demap_func - shootdown handler. Can be a range. In this case we
need to make sure to flush depending on kernelbase/userlimit
	INVPCID

If we're flushing everything (will need to do pge.cr4 twiddle /
invpcid).

hat_flush_range - panic dump thing. only ever done on kas. But does this
imply we should always have kpcid == 0? Seems like it would make invals
easier, we never get confused between safe cr3 kpcid and ours.

hat_mempte_release/remap - for ppcopy(). Only flushed at cpu unconfigure time.
Strictly kas. !kpm only.

flush_all_tlb_entries audit
---------------------------

TLB_INVAL_ALL on tlb_service - way out of idle

Fallback for hat_flush_range.

cr3
---

fb_swtch_src - OK, since we twiddle pge first

kdi_idthdl.s: loads safe cr3. If we don't change, this'll keep user TLB
entries around (but flush kernel since that's our source operand pcid).
Presumably that's OK. Normal trap path will go through usual exit
routine.

kdi_master_entry - switches over to kernel cr3. Same as before.

set_pteval - does reload_cr3(), but maybe only 32-bit? Also, not used
for !xpv, except via hat_kern_alloc(); maybe early so OK.

hat_kern_setup - sets tss_cr3 to whatever getcr3() is at this point,
presumably early kas cr3. Again, fine if using PCID0 for kernel.
FIXME: not set for !boot CPUs?

tss_cr3 - presumably not relevant

kpti_safe_cr3 - should be 0 PCID.

kdi_flush_caches, a reload of cr3 used by KDI slaves.
kmdb_dpi_flush_slave_caches. Used kmdb phys read/write. The meaning of
this is obscure. It's called *prior* to the read/write. But why?

i86pc/os/mp_pc.c: use of MAKECR3 here should be replaced with something
cleaner. Again would require PCID0.


kpti in, non-paranoid, from userspace: we need to do a non-flushing mov cr3 to pick up the
kernel cr3, leave userspace mappings in place.

kpti in, non-paranoid, from kernelspace: non-flushing mov cr3

kpti in, paranoid: should be same as above cases

kpti out to kernelspace: we don't modify cr3 here

kpti out to userspace: do a normal mov cr3: this should flush the (current)
kernelspace mappings iff PCID0. Perhaps assert PCID0 to verify??

hat_switch():
from user to kas
	FIXME: who deleted the VLP entries in this case? Does it matter?
	Are we lazy here?

	- we moved off the user cr3 already, but kept the mappings
	  around. We can do a non-flushing mov cr3 (it's HAT_CR3-kernel to
	  kas cr3 reload). Is this bad that we'd keep old userspace
	  mappings? Probably? Should we invpcid(user)?

	It seems like, since we remove ourselves from ->hat_cpus, that
	we must be eager here, otherwise hati_demap_func won't know to flush us.

from user to (different) user
	hat_vlp_update will trigger.
	cr3 is still kernel though. Need explicit invpcid(user).

from kas to user

	hat_vlp_update will trigger. Do we need an invalidate for moral
	equivalent of link_ptp()? Possibly; check intel manual.

	If we are lazy above (user->kas) then we must invpcid(user)
	first.


link_ptp(): will tlb shootdown if HAT_VLP. Should be sufficient for us
given above (i.e. hati_demap_func/hat_update_vlp). Really, seems like
this is mostly about copying over the new VLP entries, although it's
possible a replacement of pte may need a flush?

unlink_ptp(): if VLP, will DEMAP_ALL_ADDR.

hat_tlb_inval_range - HAT_SHARED behaviour??
	We'll have two main PCIDs: 0 (kernel) and 1 (user).

	Brief summary of PCID
	---------------------

	PCID is enabled by %cr4.PCIDE. It's available in at least Sandy Bridge, but INVPCID came later: Haswell I think.

	PCID lives is in the MMU_PAGEMASK bits of %cr3. A zero PCID is also used when %cr4.PCIDE is 0.

	With a non-zero PCID in %cr3, TLB entries are tagged with the PCID, and only TLB entries matching the current PCID are used.
	(Global TLB entries are apparently not tagged, so are used regardless.)

	mov to %cr3 is modified when %cr4.PCIDE is 1. If the top bit is 0, the PCID of the source operand is invalidated - NOT the current %cr3 PCID. Other PCIDs are not invalidated. Thus, a mov to %cr3 will NOT invalidate any "current" mappings.

	If the top bit is 1, nothing is invalidated.

	INVPCID has 4 forms, similar to INVVPID. You can shootdown an individual PCID mapping, a whole PCID, all-PCIDs-except-global, and everything (like a - supposedly faster - twiddle of %cr4.PGE).

	How we're planning to use it
	----------------------------

	With KPTI, we can no longer use PT_GLOBAL to keep our kernel TLB entries around, since when we return to userspace, we cannot allow it to speculate across KERNELBASE. In particular, on user->kernel and kernel->user, we have to switch %cr3, and hence dump all the TLB.

	We want to use PCID to mitigate this somewhat: namely, that a switch into - and out of - the kernel will at least keep our userspace mappings in the TLB. (Presuming we haven't context switched to another process of course).

	To do that we'll have PCID1 for the kernel, and PCID2 for userspace, ensuring that we don't flush PCID2 unless it's necessary.

	FIXME: tlb shootdown/inval
	FIXME: hat_switch
	FIXME: lack of INVVPCID

	- pcide changes PAT behavoiur; problem? FIXME

	cr4 audit
	---------

	pat_sync() cr4.pge twiddle to flush all
	flush_all_tlb_entries() cr4.pge twiddle

	mmu_tlbflush_entry audit
	-------------------------

	flush a single page via invlpg

	kboot_mmu - presume this is all pre-PCIDE. Should validate.

	i86pc/vm/hat_kdi.c: used prior to kernel (bootload kmdb). Therefore
	can't presume PCID. Always a kernel-only mapping, but maybe setup prior
	to PCID (hat_kdi_init). Always flushed after use though, so this seems
	OK.

	x86pte_mapin - only if !kpm. ASSERT()?
	x86pte_set - local flush. if > KERNELBASE flush kpcid, else upcid
	INVPCID
	x86pte_copy - only !kpm

	hati_demap_func - shootdown handler. Can be a range. In this case we
	need to make sure to flush depending on kernelbase/userlimit
	INVPCID

	If we're flushing everything (will need to do pge.cr4 twiddle /
	invpcid).

	hat_flush_range - panic dump thing. only ever done on kas. But does this
	imply we should always have kpcid == 0? Seems like it would make invals
	easier, we never get confused between safe cr3 kpcid and ours.

	hat_mempte_release/remap - for ppcopy(). Only flushed at cpu unconfigure time.
	Strictly kas. !kpm only.

	flush_all_tlb_entries audit
	---------------------------

	TLB_INVAL_ALL on tlb_service - way out of idle

	Fallback for hat_flush_range.

	cr3
	---

	fb_swtch_src - OK, since we twiddle pge first

	kdi_idthdl.s: loads safe cr3. If we don't change, this'll keep user TLB
	entries around (but flush kernel since that's our source operand pcid).
	Presumably that's OK. Normal trap path will go through usual exit
	routine.

	kdi_master_entry - switches over to kernel cr3. Same as before.

	set_pteval - does reload_cr3(), but maybe only 32-bit? Also, not used
	for !xpv, except via hat_kern_alloc(); maybe early so OK.

	hat_kern_setup - sets tss_cr3 to whatever getcr3() is at this point,
	presumably early kas cr3. Again, fine if using PCID0 for kernel.
	FIXME: not set for !boot CPUs?

	tss_cr3 - presumably not relevant

	kpti_safe_cr3 - should be 0 PCID.

	kdi_flush_caches, a reload of cr3 used by KDI slaves.
	kmdb_dpi_flush_slave_caches. Used kmdb phys read/write. The meaning of
	this is obscure. It's called prior to the read/write. But why?

	i86pc/os/mp_pc.c: use of MAKECR3 here should be replaced with something
	cleaner. Again would require PCID0.


	kpti in, non-paranoid, from userspace: we need to do a non-flushing mov cr3 to pick up the
	kernel cr3, leave userspace mappings in place.

	kpti in, non-paranoid, from kernelspace: non-flushing mov cr3

	kpti in, paranoid: should be same as above cases

	kpti out to kernelspace: we don't modify cr3 here

	kpti out to userspace: do a normal mov cr3: this should flush the (current)
	kernelspace mappings iff PCID0. Perhaps assert PCID0 to verify??

	hat_switch():
	from user to kas
	FIXME: who deleted the VLP entries in this case? Does it matter?
	Are we lazy here?

	- we moved off the user cr3 already, but kept the mappings
	around. We can do a non-flushing mov cr3 (it's HAT_CR3-kernel to
	kas cr3 reload). Is this bad that we'd keep old userspace
	mappings? Probably? Should we invpcid(user)?

	It seems like, since we remove ourselves from ->hat_cpus, that
	we must be eager here, otherwise hati_demap_func won't know to flush us.

	from user to (different) user
	hat_vlp_update will trigger.
	cr3 is still kernel though. Need explicit invpcid(user).

	from kas to user

	hat_vlp_update will trigger. Do we need an invalidate for moral
	equivalent of link_ptp()? Possibly; check intel manual.

	If we are lazy above (user->kas) then we must invpcid(user)
	first.


	link_ptp(): will tlb shootdown if HAT_VLP. Should be sufficient for us
	given above (i.e. hati_demap_func/hat_update_vlp). Really, seems like
	this is mostly about copying over the new VLP entries, although it's
	possible a replacement of pte may need a flush?

	unlink_ptp(): if VLP, will DEMAP_ALL_ADDR.

	hat_tlb_inval_range - HAT_SHARED behaviour??