- This file contains notes related to The Art of Multiprocessor Programming, Appendix B organized by section.
- Appendix B covers prerequisite knowledge related to systems such as processors, threads, memory and communication abstractions as well as concurrency concerns.
- Programming a multiprocessor requires knowledge about the computer's architecture.
- Processors are hardware that execute software threads.
- Generally, a processor will be related to multiple threads but deal with a single thread at any given moment.
- An interconnect is a communication link shared between processors as well as memory.
- Memory's primary purpose is to store data and is typically structured as a hierarchy of components.
- An aspect of a processor's performance is contingent on what level the memory component lives on within the hierarchy.
- The speed at which it takes a processor to access memory is known as latency.
- An atomic operation is an operation that can occur uninterrupted by other threads and processes.
- This is important because we can anticipate the state of a shared resource in a predictable manner which is a central concern parallel algorithms.
- Test-and-set lock is spin algorithm lock
graph LR;
subgraph Thread B
direction LR
A1("...") -- approaching critical section --> B1("Call testAndSet(lockValue)<br/>Make copy of current lock value<br/>Set lockValue to 1<br/>return copy")
B1 --> C1{"Is copy == 0?"}
C1 -- no --> B1
C1 -- yes --> D1("Perform action in critical section")
D1 --> E1("Set lockValue to 0")
E1 --> F1("...")
end
subgraph Thread A
direction LR
A2("...") -- approaching critical section --> B2("Call testAndSet(lockValue)<br/>Make copy of current lock value<br/>Set lockValue to 1<br/>return copy")
B2 --> C2{"Is copy == 0?"}
C2 -- no --> B2
C2 -- yes --> D2("Perform action in critical section<br>")
D2 --> E2("Set lockValue to 0")
E2 --> F2("...")
end
- A multiprocessor is a system that uses multiple processors.
- Each processor executes a sequential program.
- A thread is a software construct that represents a sequential program.
- A context switch is when a processor changes from one thread to another which is an action known as descheduling.
- Context switches negatively impact performance and are generally considered costly.
- A scheduler1 is an abstract mechanism used to determine the order in which the processor will execute threads.
- A dispatcher1 is an abstract mechanism responsible for performing the context switch.
- A cycle is the metric used to measure the amount of time it takes a processor to fetch and execute a single instruction.
- An architecture is necessary to implement the communication link between processors and memory.
- Non Uniform memory access (NUMA) architecture utilizes a network-styled design of nodes where each node consists of a set of processors as well as local memory. Each node can access any node's local memory collectively creating a global scope shared by all processors.
- A processor can access its node's memory faster than other nodes.
- NUMA scales well due to its network design; however, the trade off is that it is more complex to manage.
- Symmetric multiprocessing (SMP) is the most common architecture where processors and memory have a bus control unit and are linked together with a bus.
- The bus control unit listens / sends messages on the bus.
- SMP systems typically cannot be scaled well due to the constraints of the bus.
- Non Uniform memory access (NUMA) architecture utilizes a network-styled design of nodes where each node consists of a set of processors as well as local memory. Each node can access any node's local memory collectively creating a global scope shared by all processors.
- The interconnect is a finite resource and should be managed effectively.
- Essentially, memory is a large array of words indexed by addresses.
- A word is a fundamental unit of data that is a multiple of a byte and typically determined by a processor.
- A 64-bit processor would likely store a 64-bit word for every 64-bit address.
- Essentially, a processor sends a read or write request to a memory component that containing a memory address then the memory component uses that address to fulfill the operation.
- A processor / thread sees logical addresses, system uses physical addresses and virtual addresses bind them together.2
- A processor's request to memory could take hundreds of cycles which can constrain performance.
- A cache is a secondary storage device within the memory hierarchy that is physically positioned closer to the processor
- A modern processor will typically have a L1, L2 cache.3
- Some caches, typically L1 and sometimes L2, are private caches meaning that only a single processor has access to it.4
- A cache can fulfill a processor's memory request substantially faster than main memory.
- Cache's uses locality to store frequently used data to circumvent the need to send requests to main memory.
- If a cache contains the address related to the processor's request then it is a hit and will fulfill the request.3
- A cache has substantially less space due to being more costly to manufacture; this requires management of it's space.
- The primary method it uses to manage space is logic known as a replacement policy which determines what data is stored or removed when full.
- Cache coherence is a condition that asserts all private cache entries sharing common addresses should be consistent with data in main memory; otherwise, invalidate cache entry.
- A common protocol used to enforce cache coherence is called MESI which stands for "Modified, Exclusive, Shared, Invalid" and is related to possible cache line states.
- Modified means data related to address stored in cache has been modified and needs to be rewritten to main memory.
- Exclusive means data related to address stored in cache has not been modified; no other process has cache this address.
- Shared means data related to address stored in cache has not been modified and other processors may have the same address cached.
- Invalid means data related to address stored in cache violates coherence and is no longer meaningful
- False sharing is a pattern in MESI where two caches are sharing an entry; one of them frequently reading and the other writing frequently. This causes the cache entry to become invalid frequently hindering performance.
- This is more likely to happen in caches with larger memory which is a potential trade-off.
- A processor is spinning if it is repeatedly checking for some data in memory to conform to some condition which is a waste of resources.
- Essentially, the TAS-Lock will constantly hammer the bus with requests to memory related to the bool field. This causes a lot of traffic that impedes other threads and reduces performance.
- Contrastly, the TTAS-Lock has better performance than the TAS-lock because the initial read operation utilizes its L1 cache reducing traffic on the bus.
- Ultimately, both lock spin-algorithms which yield poor performance.
- In a multi-core architecture a chip will consist of multiple processors each of which contain their own L1 cache and share an L2 cache. This setup can enable efficient communication.
- In a multi-threaded architecture a single processor is capable of executing multiple threads simultaneously; this is done with multiple processor or a single processor with multiple cores.
- Modern processor utilize both architecture designs
- When a processor performs a write operation the data is saved in cache and its state is set to dirty.
- This implies that the value will eventually need to be written to main memory.
- A more modern approach is to use a write buffer
- Write buffers store write operations and apply them to memory together at a later time via batching.
- Write absorption can discard earlier values if a secondary write was later made before applying it to memory.
- It appears they can cause a read conflict; it appears that it is possible that the buffer won't make the changes in main memory fast enough and incorrect values can be yielded.
- It mentions that compilers can reorder thread execution which can cause inconsistencies with read / write operations
- An attempt to resolve this was to implement guarantees about the extent to which reorderings can be made; it recommends you shouldn't rely on these.
- It instead recommends the use of memory barrier instructions that expose content within write buffers.
Footnotes
-
Supplemental content from a systems course; terminology is not well grounded within the field and often interchanged in different literature. ↩ ↩2
-
Supplemental content from a systems course; loosely references how a system manages the abstraction of memory. ↩
-
Supplemental content from a systems course; reviews caches and how they fit inside the memory abstractions. ↩ ↩2
-
Supplemental content; a wikipedia artcile describing aspects of the cache hierarchy. ↩