Skip to content

Instantly share code, notes, and snippets.

@jbush001
Last active Nov 16, 2017
Embed
What would you like to do?

Problem Description

The current Nyuzi TLB implementation caches virtual->physical translations on a page granularity, where pages are fixed size at 4k. However, the performance of programs that touch a large area of memory can be limited by the overhead of handling TLB misses. A way to mitigate this is to allow mapping pages that are larger than 4k, usually a power-of-two multiple like 4MB. Example use cases include mapping a physical memory alias into the kernel and mapping a graphics framebuffer (which is often contiguous).

Implementation

Although the TLB implementation is software managed and thus technically could use any encoding for page translations, the design assumes and is optmized for a two level table hierarchy, where a 4k page is the page directory, with 1024 entries that each point to a 4k page table with 1024 page sized entries.

Page Dir -->  +--------+
              |        |
              +--------+
              ~~~~~~~~~~
              ~~~~~~~~~~
              +--------+                    +--------+
              |        | -- Page Table ---> |        |
              +--------+                    +--------+
                                            |        |  ----->  Page
                                            +--------+
              

This proposal will add the ability for a page directory entry to point to a 4MB memory page instead of a page table. It will add a new flag to the entry that indicates if a page directory entry is a largepage. This is what a page directory entry looks like now:

+----------------------------------------+-----------------------+-+
|         page table address (20)        |       unused (9)      |P|
+----------------------------------------+-----------------------+-+

This proposal would add the 'L' flag, and also make the page table entries GSXW relevant for page directory entries:

+----------------------------------------+-------------+-+-+-+-+-+-+
|         page table address (20)        | unused (6)  |L|G|S|X|W|P|
+----------------------------------------+-------------+-+-+-+-+-+-+

P - present
W - Writable
X - Executable
S - Supervisor
G - Global
L - Large

Normally, the software TLB miss handler checks the present flag on the page directory entry, then dereferences the pointer to read the page table. It then copies the page table entry directly into the TLB. The new TLB handler would check the L flag on the page directory entry. If it is set, it will put the page directory entry directly into the TLB and skip the traversal.

In hardware, the TLB is currently implemented as a set associative cache, which performs a lookup based on the virtual page number. The new implementation will add a small, parallel, fully associative cache for looking up large pages (this is all contained in tlb.v)

               Large Page Lookup
           +----+ +----+ +----+ +----+
     +---> |    | |    | |    | |    | -------+
     |     +----+ +----+ +----+ +----+        |
     |                                        |      
     |          Small Page Lookup             +-----> |\
     |      way0   way1   way2   way3                 | |_____\
     |     +----+ +----+ +----+ +----+                | |     /
va --+     |    | |    | |    | |    |        +-----> |/
     |     +----+ +----+ +----+ +----+        |
     |     +----+ +----+ +----+ +----+        |
     +-->  |    | |    | |    | |    |  ------+
           +----+ +----+ +----+ +----+ 
           +----+ +----+ +----+ +----+ 
           |    | |    | |    | |    |
           +----+ +----+ +----+ +----+ 
                      ...

The large page cache will contain a small number of entries (around 4-8). This keeps the design simple and doesn't require a set associative cache. Since it seems unlikely that there will be a lot of these pages simultaneously (as they cover a large area of memory), this seems sufficient. When a 'tlbinsert' instruction is executed, hardware will choose which cache to insert it into based on the 'L' flag. If it is set, it will insert the entry into the large page cache, otherwise into the small page cache. The tlb invalidate instruction will apply to both the large and small page caches.

It is possible in this configuration to have translations for a specific address in both the small and large caches. In this case, the large cache will be used. It is the reponsibility of software to invalidate all small page translations before inserting a large page.

From the view of hardware modules outside the 'tlb' module, a large page should look the same as if software had created 1024 small page TLB pages and put them into the TLB.

Although the virtual address will necessarily be 4MB aligned, it is not clear that the physical addresses need to be. Consider the cache where two large pages have overlapping physical ranges

Testing

  • Functional

    • Ensure large and small page collisions are handled properly (that is, someone [incorrectly] inserts TLB entries for a small page that is inside the address range for a large page. In this case, the large page should probably take precedence (although we could argue this is a bug in system software and the behavior is undefined).
    • Try several virtual addresses within a large page
    • Ensure inserting a large page entry doesn't alter small page lookup and vice versa
    • All page flag tests probably need to be have into small and large page versions to ensure flags are read properly (might be able to test both cases in the same file to avoid too much duplication).
  • Performance

    • Any test that currently uses a framebuffer seems like a useful candidate. For example, doom, rotozoom, or quakeview.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment