This repository contains all the work done for the project TCG Plugin: Cache modelling, wherein a multi-core, multi-level cache modelling TCG plugin is developed. I also wrote a QEMU blog post that contains more technical information about the internals of the plugin, along with an example demonstrating how to make use of it. The plugin can be optionally attached to QEMU on either user-mode emulation or full-system emulation. On finishing execution, the plugin outputs statistics related to cache performance using the working set proposed by the memory access pattern proposed by the emulation target.
QEMU, as a multi-arch emulator, uses TCG as one means to translate the target architecture's instructions to host instructions that run on the processor executing QEMU. TCG has the ability to register subscribers on several events, which make up the TCG plugins subsystem. Through plugins, one can subscribe to events such as instruction translation, instruction execution, and memory access, and instrument those through registered callbacks.
TCG plugins can observe the system down to the granularity of individual instruction execution and memory access. By utilizing the QEMU plugin API, we've intercepted those events and emulated CPU caches that are pre-run-time configurable.
While different microarchitectures often have different approaches at the very low level, the core concepts of caching are universal. As QEMU is not a microarchitectural emulator we model an ideal caching system with a few simple parameters. By doing so, we can adequately emulate the behaviour of a caching system without diverging from real-hardware behaviour.
The scope of the plugin is to catch the trends that depend on the memory access pattern, which is largely dependent on the executable code, without delving into microarchitectural details that change from one microarchitecture to another.
Having that said, we limit ourselves to private L1 per-core instruction caches, and private L1 per-core data caches, with the ability to optionally emulate unified (data + instructions) per-core L2 caches.
To keep it simple, no inter-core interaction was taken into account, since such considerations are largely implementation-dependent and will vary from one microarchitecture to another.
Emulated caches are configurable in terms of the following parameters
- Overall cache size
- Block(line) size
- Set-associativity
A single cache eviction policy can also be specified as a plugin argument. This policy can be one of the following
- Least Recently Used (LRU)
- First-in first-evicted
- Random eviction
For multi-threaded user-space programs, and full-system emulation that have access to more than one core, we can specify the number of "cores" to take into account and emulate caches for.
QEMU can emulate multi-threaded user-space applications, and it can provide more than one CPU for a guest kernel in full-system emulation. These kinds of working sets are supported through the following mechanisms.
TCG plugins have access to basic information about the system, such as the number of core available for the guest. By default, This information is used to construct a cache emulation system for each core available. (i.e. have L1 instruction cache and L1 data cache, and optionally L2 unified cache)
A subscription callback for a memory access event has access to the vCPU index that initiated the access. This is used to identify the cache to access.
User-space emulation targets mirror the thread structure of the emulated program, and it's bound by how many threads the host kernel will allow it to create. This means that we cannot know how many threads will be created prior to running.
To mitigate this, the plugin tracks a static number of cores (1 by default) that can be configured as a plugin argument. If the number of threads is more than the number of available cores, the threads may thrash each other.
This mirrors how kernels allow user-space applications to make as many threads as they want, but eventually, those threads must be scheduled on an available physical core and subsequently may thrash each other.
All the goals defined by the project proposal were successfully met and merged (either upstreamed or in the maintainer's tree, since at the time of GSoC 2021 ending, the QEMU project is on a release cycle). However, there could be some convenient features to aid input and output.
The plugin has various parameters and usually the great majority of them are used. This makes the invocation command cluttery. Hence, the plugin could make good use of parsing a configuration file and get its parameters from it.
Also, it would be nice if we could support outputting the data in a standard format like YAML or JSON, as plugins outputs could be fed into another program for post-processing.
I'd like to show my sincere gratitude to Alex Bennée (stsquad on IRC) for mentoring me, patiently reviewing my patches, and answering my questions.
I'd also like to thank the QEMU community for helping me on various occasions during my GSoC participation.
Only changes accepted or on-going are listed in this section.
[PATCH v4 0/5] plugins: New TCG plugin for cache modelling
- [PATCH v4 1/5] plugins: Added a new cache modelling plugin
- [PATCH v4 2/5] plugins/cache: Enable cache parameterization
- [PATCH v4 3/5] plugins/cache: Added FIFO and LRU eviction policies
- [PATCH v5] docs/devel: Added cache plugin to the plugins docs
- [PATCH v5] MAINTAINERS: Added myself as a reviewer for TCG Plugins
Misc. Updates and bug fixes
- [PATCH 1/6] plugins/cache: Fixed a bug with destroying FIFO metadata
- [PATCH 2/6] plugins/cache: limited the scope of a mutex lock
- [PATCH 6/6] plugins/cache: Fixed "function decl. is not a prototype" warnings
[PATCH v5 0/2] plugins/cache: multicore cache modelling
- [PATCH v5 1/2] plugins/cache: supported multicore cache modelling
- [PATCH v5 2/2] docs/devel/tcg-plugins: added cores arg to cache plugin
[PATCH v4 00/13] new plugin argument passing scheme
[PATCH 0/5] plugins/cache: L2 cache modelling and a minor leak fix
- [PATCH 1/5] plugins/cache: freed heap-allocated mutexes
- [PATCH 2/5] plugins/cache: implement unified L2 cache emulation
- [PATCH 3/5] plugins/cache: split command line arguments into name and value
- [PATCH 4/5] plugins/cache: make L2 emulation optional through args
- [PATCH 5/5] docs/tcg-plugins: add L2 arguments to cache docs
[PATCH v4 00/13] new plugin argument passing scheme
- [PATCH v4 01/13] plugins: allow plugin arguments to be passed directly
- [PATCH v4 02/13] plugins/api: added a boolean parsing plugin api
- [PATCH v4 13/13] docs/deprecated: deprecate passing plugin args through
arg=
[PATCH v4] plugins/syscall: Added a table-like summary output
[PATCH] plugins/execlog: removed unintended "s" at the end of log lines.
[PATCH v4 00/13] new plugin argument passing scheme
- [PATCH v4 03/13] plugins/hotpages: introduce sortby arg and parsed bool args correctly
- [PATCH v4 04/13] plugins/hotblocks: Added correct boolean argument parsing
- [PATCH v4 05/13] plugins/lockstep: make socket path not positional & parse bool arg
- [PATCH v4 06/13] plugins/hwprofile: adapt to the new plugin arguments scheme
- [PATCH v4 07/13] plugins/howvec: adapting to the new argument passing scheme
- [PATCH v4 09/13] tests/plugins/bb: adapt to the new arg passing scheme
- [PATCH v4 10/13] tests/plugins/insn: made arg inline not positional and parse it as bool
- [PATCH v4 11/13] tests/plugins/mem: introduce "track" arg and make args not positional
- [PATCH v4 12/13] tests/plugins/syscalls: adhere to new arg-passing scheme
[PATCH v3] blog: add a post for the new TCG cache modelling plugin
At the time of this post, single-core cache emulation is merged to QEMU upstream. A fresh clone of the QEMU source will have that code in. Multi-core emulation is accepted by the maintainer, but since the QEMU project is on a revision cycle, it's not getting any major updates, so the code will have to wait after the project stabilization in order to get merged. In order to get that, you may manually apply the relevant patches, or fetch the plugins/next tree. The plugins/next tree also contains the new plugin argument passing scheme. The L2 part is waiting to be reviewed and is still experimental. It must be manually applied through the appropriate patches stated above.