jbush001/large_tlb.md

## large_tlb.md

      
    Raw
  

              large_tlb.md
            
          
    Problem Description

The current Nyuzi TLB implementation caches virtual->physical translations on a page granularity, where pages are fixed size at 4k.  However, the performance of programs that touch a large area of memory can be limited by the overhead of handling
TLB misses. A way to mitigate this is to allow mapping pages that are larger than 4k, usually a power-of-two multiple
like 4MB. Example use cases include mapping a physical memory alias into the kernel and mapping a graphics framebuffer
(which is often contiguous).
Implementation

Although the TLB implementation is software managed and thus technically could use any encoding for page translations,
the design assumes and is optmized for a two level table hierarchy, where a 4k page is the page directory,
with 1024 entries that each point to a 4k page table with 1024 page sized entries.
Page Dir -->  +--------+
              |        |
              +--------+
              ~~~~~~~~~~
              ~~~~~~~~~~
              +--------+                    +--------+
              |        | -- Page Table ---> |        |
              +--------+                    +--------+
                                            |        |  ----->  Page
                                            +--------+
              

This proposal will add the ability for a page directory entry to point to a 4MB memory page instead of a
page table. It will add a new flag to the entry that indicates if a page directory entry is a largepage.
This is what a page directory entry looks like now:
+----------------------------------------+-----------------------+-+
|         page table address (20)        |       unused (9)      |P|
+----------------------------------------+-----------------------+-+

This proposal would add the 'L' flag, and also make the page table entries GSXW relevant
for page directory entries:
+----------------------------------------+-------------+-+-+-+-+-+-+
|         page table address (20)        | unused (6)  |L|G|S|X|W|P|
+----------------------------------------+-------------+-+-+-+-+-+-+

P - present
W - Writable
X - Executable
S - Supervisor
G - Global
L - Large

Normally, the software TLB miss handler checks the present flag on the page directory entry, then dereferences
the pointer to read the page table. It then copies the page table entry directly into the TLB. The new TLB handler
would check the L flag on the page directory entry. If it is set, it will put the page directory entry directly
into the TLB and skip the traversal.
In hardware, the TLB is currently implemented as a set associative cache, which performs a lookup based
on the virtual page number. The new implementation will add a small, parallel, fully associative
cache for looking up large pages (this is all contained in tlb.v)
               Large Page Lookup
           +----+ +----+ +----+ +----+
     +---> |    | |    | |    | |    | -------+
     |     +----+ +----+ +----+ +----+        |
     |                                        |      
     |          Small Page Lookup             +-----> |\
     |      way0   way1   way2   way3                 | |_____\
     |     +----+ +----+ +----+ +----+                | |     /
va --+     |    | |    | |    | |    |        +-----> |/
     |     +----+ +----+ +----+ +----+        |
     |     +----+ +----+ +----+ +----+        |
     +-->  |    | |    | |    | |    |  ------+
           +----+ +----+ +----+ +----+ 
           +----+ +----+ +----+ +----+ 
           |    | |    | |    | |    |
           +----+ +----+ +----+ +----+ 
                      ...

The large page cache will contain a small number of entries (around 4-8). This keeps the design
simple and doesn't require a set associative cache. Since it seems unlikely that there will be a lot
of these pages simultaneously (as they cover a large area of memory), this seems sufficient.
When a 'tlbinsert' instruction is executed, hardware will choose which cache to insert it into based
on the 'L' flag. If it is set, it will insert the entry into the large page cache, otherwise into
the small page cache. The tlb invalidate instruction will apply to both the large and small page
caches.
It is possible in this configuration to have translations for a specific address in both the small
and large caches. In this case, the large cache will be used. It is the reponsibility of software to
invalidate all small page translations before inserting a large page.
From the view of hardware modules outside the 'tlb' module, a large page should look the same as if
software had created 1024 small page TLB pages and put them into the TLB.
Although the virtual address will necessarily be 4MB aligned, it is not clear that the physical
addresses need to be. Consider the cache where two large pages have overlapping physical ranges
Testing


Functional

Ensure large and small page collisions are handled properly (that is, someone [incorrectly] inserts TLB entries for a small page that is inside the address range for a large page. In this case, the large page should probably take precedence (although we could argue this is a bug in system software and the behavior is undefined).
Try several virtual addresses within a large page
Ensure inserting a large page entry doesn't alter small page lookup and vice versa
All page flag tests probably need to be have into small and large page versions to ensure flags are read properly (might be able to test both cases in the same file to avoid too much duplication).


Performance

Any test that currently uses a framebuffer seems like a useful candidate. For example, doom, rotozoom, or quakeview.