How can one understand and predict the performance of multicore algorithms?
As a first step, I'd recommend acquiring a basic understanding of the MSI protocol. Modern systems use somewhat more complex protocols, but a basic understanding of MSI should be enough to predict the most important phenomena.
Below is an even further simplified model of shared memory based on the MSI protocol. While the model is not precise, and makes several simplifications not based on actual hardware, it should be accurate enough to predict many phenomena, including the false sharing -performance pitfall, for example.
Simplified MSI model of shared memory
The memory is divided into locations called cache lines. Each core has its own copy of the whole memory (all cache lines) and additionally considers each of its own cache lines to be in one of the following three states:
Modified
Shared
Invalid
State changes do not happen spontaneously. Some core needs to perform a read from or a write to a cache line.
When a cache line of a core is in the Modified
state, both reads and writes
are cheap. The state of the cache line needs to be Invalid
in all the other
cores and neither read nor write require changing the state of the cache line in
those other cores.
Reading a cache line in the Shared
state is cheap. Writing to a cache line in
the Shared
state is more expensive. It means that the state of the cache line
in the core that performs the write is changed to the Modified
state.
Additionally, the state of the corresponding cache line in all the other cores
is changed to the Invalid
state.
When a core wants to read a cache line in the Invalid
state, it needs another
core that has that cache line in either Modified
or Shared
state to send a
copy. If the sender was in Modified
state, then the sender must change state
to Shared
. The reader also sets the state to Shared
. Writing to an Invalid
cache line also requires another core that has that cache line in Modified
or
Shared
state to send a copy of the cache line and then all the cores that have
the cache line in Modified
or Shared
state must change their state to
Invalid
. The writer sets the state to Modified
.
The various state changes and transmissions of cache line contents have
potentially different costs. Transitions from the Invalid
state to other
states tend to be expensive.
While drastically simplified, this model should be accurate enough to help to predict the performance of many multicore algorithms. To predict the performance of a specific algorithm relative to another, determine and count the different state changes and cache line transfers for both algorithms.
Concrete timings
This gist also includes a program written for Apple M1 to try to quantify the relative costs of various state changes. On my Apple M1 laptop I get roughly the following output from the program:
Relative to fastest:
Read M: 1.00 *
Write M: 2.58 ***
Read S: 1.04 *
Write S: 85.40 *************************************************************************************
Read Is: 2.36 **
Write Is: 31.34 *******************************
Read Im: 5.02 *****
Write Im: 32.61 *********************************
Note that the program uses constants (related to L1 cache line size and cache size) that are specific to Apple M1.