Shawn-Armstrong/#The_Art_of_Multiprocessor_Programming_appendix_b_notes.md

## #The_Art_of_Multiprocessor_Programming_appendix_b_notes.md

      
    Raw
  

              #The_Art_of_Multiprocessor_Programming_appendix_b_notes.md
            
          
#The_Art_of_Multiprocessor_Programming_appendix_b_notes.md

Overview


This file contains notes related to The Art of Multiprocessor Programming, Appendix B organized by section.
Appendix B covers prerequisite knowledge related to systems such as processors, threads, memory and communication abstractions as well as concurrency concerns.

Introduction (and a Puzzle)


Programming a multiprocessor requires knowledge about the computer's architecture.
Processors are hardware that execute software threads.
Generally, a processor will be related to multiple threads but deal with a single thread at any given moment.
An interconnect is a communication link shared between processors as well as memory.
Memory's primary purpose is to store data and is typically structured as a hierarchy of components.
An aspect of a processor's performance is contingent on what level the memory component lives on within the hierarchy.
 
The speed at which it takes a processor to access memory is known as latency.
An atomic operation is an operation that can occur uninterrupted by other threads and processes.

This is important because we can anticipate the state of a shared resource in a predictable manner which is a central concern parallel algorithms.


Test-and-set lock is spin algorithm lock


      graph LR;

    subgraph Thread B
        direction LR
        A1("...") -- approaching critical section --> B1("Call testAndSet(lockValue)<br/>Make copy of current lock value<br/>Set lockValue to 1<br/>return copy")
        B1 --> C1{"Is copy == 0?"}
        C1 -- no --> B1
        C1 -- yes --> D1("Perform action in critical section")
        D1 --> E1("Set lockValue to 0")
        E1 --> F1("...")
    end

    subgraph Thread A
        direction LR
        A2("...") -- approaching critical section --> B2("Call testAndSet(lockValue)<br/>Make copy of current lock value<br/>Set lockValue to 1<br/>return copy")
        B2 --> C2{"Is copy == 0?"}
        C2 -- no --> B2
        C2 -- yes --> D2("Perform action in critical section<br>")
        D2 --> E2("Set lockValue to 0")
        E2 --> F2("...")
    end

    
      Loading

  
Processors and Threads


A multiprocessor is a system that uses multiple processors.
Each processor executes a sequential program.
A thread is a software construct that represents a sequential program.
A context switch is when a processor changes from one thread to another which is an action known as descheduling.

Context switches negatively impact performance and are generally considered costly.


A scheduler¹ is an abstract mechanism used to determine the order in which the processor will execute threads.
A dispatcher¹ is an abstract mechanism responsible for performing the context switch.
A cycle is the metric used to measure the amount of time it takes a processor to fetch and execute a single instruction.

Interconnect


An architecture is necessary to implement the communication link between processors and memory.

Non Uniform memory access (NUMA) architecture utilizes a network-styled design of nodes where each node consists of a set of processors as well as local memory. Each node can access any node's local memory collectively creating a global scope shared by all processors.

A processor can access its node's memory faster than other nodes.
NUMA scales well due to its network design; however, the trade off is that it is more complex to manage.


Symmetric multiprocessing (SMP) is the most common architecture where processors and memory have a bus control unit and are linked together with a bus.

The bus control unit listens / sends messages on the bus.
SMP systems typically cannot be scaled well due to the constraints of the bus.


The interconnect is a finite resource and should be managed effectively.

Memory


Essentially, memory is a large array of words indexed by addresses.

A word is a fundamental unit of data that is a multiple of a byte and typically determined by a processor.
A 64-bit processor would likely store a 64-bit word for every 64-bit address.


Essentially, a processor sends a read or write request to a memory component that containing a memory address then the memory component uses that address to fulfill the operation.

A processor / thread sees logical addresses, system uses physical addresses and virtual addresses bind them together.²


Cache


A processor's request to memory could take hundreds of cycles which can constrain performance.
A cache is a secondary storage device within the memory hierarchy that is physically positioned closer to the processor
A modern processor will typically have a L1, L2 cache.³
Some caches, typically L1 and sometimes L2, are private caches meaning that only a single processor has access to it.⁴
A cache can fulfill a processor's memory request substantially faster than main memory.
Cache's uses locality to store frequently used data to circumvent the need to send requests to main memory.

If a cache contains the address related to the processor's request then it is a hit and will fulfill the request.³


A cache has substantially less space due to being more costly to manufacture; this requires management of it's space.

The primary method it uses to manage space is logic known as a replacement policy which determines what data is stored or removed when full.


Coherence


Cache coherence is a condition that asserts all private cache entries sharing common addresses should be consistent with data in main memory; otherwise, invalidate cache entry.
A common protocol used to enforce cache coherence is called MESI which stands for "Modified, Exclusive, Shared, Invalid" and is related to possible cache line states.

Modified means data related to address stored in cache has been modified and needs to be rewritten to main memory.
Exclusive means data related to address stored in cache has not been modified; no other process has cache this address.
Shared means data related to address stored in cache has not been modified and other processors may have the same address cached.
Invalid means data related to address stored in cache violates coherence and is no longer meaningful


False sharing is a pattern in MESI where two caches are sharing an entry; one of them frequently reading and the other writing frequently. This causes the cache entry to become invalid frequently hindering performance.

This is more likely to happen in caches with larger memory which is a potential trade-off.


Spinning


A processor is spinning if it is repeatedly checking for some data in memory to conform to some condition which is a waste of resources.

Cache-Conscious Programming, or the Puzzle Solved


Essentially, the TAS-Lock will constantly hammer the bus with requests to memory related to the bool field. This causes a lot of traffic that impedes other threads and reduces performance.
Contrastly, the TTAS-Lock has better performance than the TAS-lock because the initial read operation utilizes its L1 cache reducing traffic on the bus.
Ultimately, both lock spin-algorithms which yield poor performance.

Multi-Core and Multi-Threaded Architectures


In a multi-core architecture a chip will consist of multiple processors each of which contain their own L1 cache and share an L2 cache. This setup can enable efficient communication.
In a multi-threaded architecture a single processor is capable of executing multiple threads simultaneously; this is done with multiple processor or a single processor with multiple cores.
Modern processor utilize both architecture designs

Relaxed Memory Consistency


When a processor performs a write operation the data is saved in cache and its state is set to dirty.

This implies that the value will eventually need to be written to main memory.
A more modern approach is to use a write buffer


Write buffers store write operations and apply them to memory together at a later time via batching.

Write absorption can discard earlier values if a secondary write was later made before applying it to memory.
It appears they can cause a read conflict; it appears that it is possible that the buffer won't make the changes in main memory fast enough and incorrect values can be yielded.


It mentions that compilers can reorder thread execution which can cause inconsistencies with read / write operations

An attempt to resolve this was to implement guarantees about the extent to which reorderings can be made; it recommends you shouldn't rely on these.
It instead recommends the use of memory barrier instructions that expose content within write buffers.


Footnotes


Supplemental content from a systems course; terminology is not well grounded within the field and often interchanged in different literature. ↩ ↩²


Supplemental content from a systems course; loosely references how a system manages the abstraction of memory. ↩


Supplemental content from a systems course; reviews caches and how they fit inside the memory abstractions. ↩ ↩²


Supplemental content; a wikipedia artcile describing aspects of the cache hierarchy. ↩


## xChatGPT_responses2.md

      
    Raw
  

              xChatGPT_responses2.md
            
          
ChatGPT_responses2.md 

Overview


This file is a log of ChatGPT questions and responses for studying purposes.

Log


QUESTION

Why does a TTS-Lock use the cache but a TS-Lock does not?


RESPONSE

A TTSLock (Test-and-Test-and-Set Lock) and a TSLock (Test-and-Set Lock) are both types of spin locks, which are used to coordinate access to shared resources among multiple threads or processes in a shared memory system. However, there are some key differences in how they work that can affect their behavior, including how they use the cache.
A TTSLock uses the cache to store information about the state of the lock, including whether or not it is currently held by another thread. When a thread wants to acquire the lock, it first checks the cache to see if the lock is currently available. If it is, the thread can acquire the lock by setting a flag in the cache to indicate that it now holds the lock. This allows the TTSLock to avoid the costly operation of reading the lock's state from memory, which can improve performance.
A TSLock, on the other hand, does not use the cache to store information about the state of the lock. Instead, it uses a single atomic instruction (such as "compare-and-swap") to directly read and write the lock's state from memory. Because the state of the lock is not stored in the cache, a TSLock does not benefit from the caching behavior and may experience cache miss and more memory access, it can be slower than a TTSLock in high contention situations.
It's important to note that the choice of lock algorithm depends on the specific requirements of the system and the trade-offs between the different algorithms. TTSLock is useful when there is low contention and high performance is a requirement, while TSLock is useful when the contention is high, as it reduces the cache coherence protocol traffic.