joinr/parallelism.txt

## parallelism.txt
Outside of the bevy of good answers, I think there's some utility in discerning between concurrency and parallelism. I like rob pike's talk "Concurrency is not Parallelism". Clojure has great features for handling concurrency (dealing with lots of things happening at the same time) and enabling correct solutions without much effort. Those features extend into enabling parallelism (doing lots of things at the same time). I think the immutable data model and the early focus on integrated STM constructs (designed to be used with the immutable data structures) helped a lot. core.async borrowing the communicating sequential processes (CSP) from go added even more options for correct concurrency.

So if you can write correct concurrent programs, then you can begin to leverage system resources (e.g. cores) to distribute how work is done and maybe get some performance improvement on top (or maybe not, or maybe up to a point....it depends).

Alternately, if you have a bunch of independent work that you want to chew through as quickly as possible using all of your available resources, then clojure still provides nice ways to handle that. Built-in facilities like pmap or pipeline from core.async provide naive ways to parallelize a workload. In theory, anywhere you use clojure.core/map to compute a sequence could be replaced with clojure.core/pmap (or a third party alternative like bounded pmap from claypoole) for gains. This is great because it's a trivial transformation that doesn't effect the correctness or reasoning of the (formerly) single-threaded computation.

The practical downside (in my experience) is that the gains on a single machine from naive parallelism operating on idiomatic clojure code with persistent data structures tends to hit a wall pretty quickly, well below linear scaling. 2x improvement in throughput is pretty easy, but as you add more cores into the mix, there's often a parasitic drag induced. This is beyond obvious analysis like Amdhal's Law since it effects workloads that are completely independent (e.g. 0 synchronization; no portion of the work is explicitly single threaded, so we would expect near-perfect linear scaling). Personal exploration over the years (and recent dedicated professional forays) have exposed what many call the "allocation wall" (particularly in HPC literature). If you have a high-allocating workload, you will end up introducing the hidden hand of allocation (even if no GC is triggered) into the mix. Even with thread-local allocation buffers on the jvm, the threads doing independent work per-core (even if they can be pegged) will still have to reach out beyond their local caches, quickly saturating the much more limited memory bandwidth. The solution (as with most highly parallelized workloads) is to keep the work as localized and independent (relative to cores and their caches) as possible. So ideally 0 allocation, primitive operations (in jvm speak). Dense numerical computations conform to this description best (or at least most obviously). What would be the worst thing then? High-allocating workloads.

How do clojure's persistent datastructures work? They efficiently derive a minimal path in a trie, create a new trie, copy the effected nodes in the old trie, propagate the change up the path to the root of the new trie, and return the new trie as a different immutable value (albeit one that shares almost all of its substructure with its predecessor). This was (circa 2010) a modern marvel in the FP space because you could get immutable hashtables (maps), arrays (vectors), and sets without whole copy-on-write with better performance than legacy binary tree approaches or naive whole-hog copy-on-write. The implementation ends up being surprisingly performant for many uses cases, so much so that it is common for computational processes in clojure to be entirely built on these idiomatic structures (with mutable variants being off the beaten path for performance). So....if you build a computational process in clojure using persistent data structures, and you now want distribute that out to N cores in parallel (with 0 shared data), you already start with an allocating workload (it's idiomatic). Allocation in itself isn't a problem (typically) since the JVM's gc can often do efficient "pointer-bump"s for short-lived ephemeral objects. You can ignorantly generate a ton of garbage without paying much of a penalty (from the allocation/gc point of view). This enables the idioms we know and love, which enable correctness of our programs (even concurrent ones) without an onerous sacrifice in performance (for varying definitions of onerous). It also (in my experience) vastly undercuts the ability for clojure programs to scale on a single system with multiple cores. My working estimate for most systems (including 96-core Epycs) is that the drag outweighs the added throughput such that you will probably converge on about 4x scaling as a typical limit/asymptote.

So if you want to parallelize computations on a single machine and use all the cores, the good news is you can do so pretty trivially with idiomatic clojure and you will get some benefit at no cost to correctness. You will probably be disappointed at the total benefit relative to the resources applied though. If you want to get better scaling, you have to chase low-allocation workloads and chase the cache more (e.g. leverage thread local variables). On linux epyc systems, I have observed much better scaling (~32x) on the same allocation-heavy workloads by running multiple jvm's with independent heaps pegged to a single thread; this is a bit of a regression in use though IMO, and entering into multiprocessing instead of the nice shared-memory model we have by default.

The surprise ending is that if we re-frame the parallelism target as parallelization by horizontal scaling (e.g. distributed computation), then Clojure seems to stretch its legs. The immutable default makes it trivial to share (and cache) data across many nodes in a cluster. If memory bandwidth per node is the limit, we can spin up more nodes (if we have the resources) with independent memory. So one can maintain correctness and scale up performance (again in theory....Universal Scaling Law exists...).

[Experts please feel free to sharp shoot the above and provide counter examples or benchmarks].
	Outside of the bevy of good answers, I think there's some utility in discerning between concurrency and parallelism. I like rob pike's talk "Concurrency is not Parallelism". Clojure has great features for handling concurrency (dealing with lots of things happening at the same time) and enabling correct solutions without much effort. Those features extend into enabling parallelism (doing lots of things at the same time). I think the immutable data model and the early focus on integrated STM constructs (designed to be used with the immutable data structures) helped a lot. core.async borrowing the communicating sequential processes (CSP) from go added even more options for correct concurrency.

	So if you can write correct concurrent programs, then you can begin to leverage system resources (e.g. cores) to distribute how work is done and maybe get some performance improvement on top (or maybe not, or maybe up to a point....it depends).

	Alternately, if you have a bunch of independent work that you want to chew through as quickly as possible using all of your available resources, then clojure still provides nice ways to handle that. Built-in facilities like pmap or pipeline from core.async provide naive ways to parallelize a workload. In theory, anywhere you use clojure.core/map to compute a sequence could be replaced with clojure.core/pmap (or a third party alternative like bounded pmap from claypoole) for gains. This is great because it's a trivial transformation that doesn't effect the correctness or reasoning of the (formerly) single-threaded computation.

	The practical downside (in my experience) is that the gains on a single machine from naive parallelism operating on idiomatic clojure code with persistent data structures tends to hit a wall pretty quickly, well below linear scaling. 2x improvement in throughput is pretty easy, but as you add more cores into the mix, there's often a parasitic drag induced. This is beyond obvious analysis like Amdhal's Law since it effects workloads that are completely independent (e.g. 0 synchronization; no portion of the work is explicitly single threaded, so we would expect near-perfect linear scaling). Personal exploration over the years (and recent dedicated professional forays) have exposed what many call the "allocation wall" (particularly in HPC literature). If you have a high-allocating workload, you will end up introducing the hidden hand of allocation (even if no GC is triggered) into the mix. Even with thread-local allocation buffers on the jvm, the threads doing independent work per-core (even if they can be pegged) will still have to reach out beyond their local caches, quickly saturating the much more limited memory bandwidth. The solution (as with most highly parallelized workloads) is to keep the work as localized and independent (relative to cores and their caches) as possible. So ideally 0 allocation, primitive operations (in jvm speak). Dense numerical computations conform to this description best (or at least most obviously). What would be the worst thing then? High-allocating workloads.

	How do clojure's persistent datastructures work? They efficiently derive a minimal path in a trie, create a new trie, copy the effected nodes in the old trie, propagate the change up the path to the root of the new trie, and return the new trie as a different immutable value (albeit one that shares almost all of its substructure with its predecessor). This was (circa 2010) a modern marvel in the FP space because you could get immutable hashtables (maps), arrays (vectors), and sets without whole copy-on-write with better performance than legacy binary tree approaches or naive whole-hog copy-on-write. The implementation ends up being surprisingly performant for many uses cases, so much so that it is common for computational processes in clojure to be entirely built on these idiomatic structures (with mutable variants being off the beaten path for performance). So....if you build a computational process in clojure using persistent data structures, and you now want distribute that out to N cores in parallel (with 0 shared data), you already start with an allocating workload (it's idiomatic). Allocation in itself isn't a problem (typically) since the JVM's gc can often do efficient "pointer-bump"s for short-lived ephemeral objects. You can ignorantly generate a ton of garbage without paying much of a penalty (from the allocation/gc point of view). This enables the idioms we know and love, which enable correctness of our programs (even concurrent ones) without an onerous sacrifice in performance (for varying definitions of onerous). It also (in my experience) vastly undercuts the ability for clojure programs to scale on a single system with multiple cores. My working estimate for most systems (including 96-core Epycs) is that the drag outweighs the added throughput such that you will probably converge on about 4x scaling as a typical limit/asymptote.

	So if you want to parallelize computations on a single machine and use all the cores, the good news is you can do so pretty trivially with idiomatic clojure and you will get some benefit at no cost to correctness. You will probably be disappointed at the total benefit relative to the resources applied though. If you want to get better scaling, you have to chase low-allocation workloads and chase the cache more (e.g. leverage thread local variables). On linux epyc systems, I have observed much better scaling (~32x) on the same allocation-heavy workloads by running multiple jvm's with independent heaps pegged to a single thread; this is a bit of a regression in use though IMO, and entering into multiprocessing instead of the nice shared-memory model we have by default.

	The surprise ending is that if we re-frame the parallelism target as parallelization by horizontal scaling (e.g. distributed computation), then Clojure seems to stretch its legs. The immutable default makes it trivial to share (and cache) data across many nodes in a cluster. If memory bandwidth per node is the limit, we can spin up more nodes (if we have the resources) with independent memory. So one can maintain correctness and scale up performance (again in theory....Universal Scaling Law exists...).

	[Experts please feel free to sharp shoot the above and provide counter examples or benchmarks].