rrnewton/meta-par-accelerate_blog_post.lhs

## meta-par-accelerate_blog_post.lhs


> module Main where

How to write hybrid CPU/GPU programs with Haskell
-------------------------------------------------

What's better than programming a GPU with a high-level,
Haskell-embedded DSL (domain-specific-language)?  Well, perhaps
writing portable CPU/GPU programs that utilize both pieces of
silicon--with dynamic load-balancing between them--would fit the
bill.

This is one of the heterogeneous programming scenarios supported by
our new **meta-par** packages.  A draft paper [can be found
here](http://www.cs.indiana.edu/~rrnewton/papers/meta-par_submission.pdf),
which explains the mechanism for building parallel schedulers out of
"mix-in" components.  In this post, however, we will skip over that
and take a look at CPU/GPU programming specifically.

This post assumes familiarity with the **monad-par** parallel
programming library, [which is described in this
paper](http://www.cs.indiana.edu/~rrnewton/papers/haskell2011_monad-par.pdf).

Getting Started
-------------------------------------------------

First, we install the just-released  [**meta-par-accelerate**
package](http://hackage.haskell.org/package/meta-par-accelerate):

    cabal install meta-par-accelerate

And then we import the following module:

> import Control.Monad.Par.Meta.AccSMP

This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par
multicore CPU) scheduling: It also reexports the [**ParAccelerate**
type
class](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#t:ParAccelerate)
  which provides the ability to launch GPU computations from within a
**Par** computation.

Next, we also import Accelerate itself to so that we can express **Acc**
computations that can run on the GPU:

> import Data.Array.Accelerate
> import Data.Array.Accelerate.CUDA as CUDA

(By the way, this blog post is an executable literate Haskell file
[that can be found here](GITHUB_GIST).)

Now we are ready to create a trivial Accelerate computation:

> triv :: Acc (Scalar Float)
> triv = let arr = generate (constant (Z :. (10::Int))) (\ i -> 3.3 )
>        in fold (+) 0 arr

We could run this directly using CUDA, which would print out
**Array (Z) [33.0]**, Accelerate's of saying **33.0**
(i.e. it's a zero-dimensional array):

> runDirect = print (CUDA.run triv)

If we are instead inside a Par computation, we simply use **runACC* or
**runAccWith**:

> runWPar1   = print (runPar (runAcc triv))
> runWPar2   = print (runPar (runAccWith CUDA.run triv))

The former uses the default Accelerate implementation.  The latter
specifies which Accelerate implementation to use.  After all, there
might ultimately be several -- OpenCL, CUDA, plus varius CPU backends.

(In the future, we plan to add the ability to change the default
Accelerate backend either at the **runPar** site, or statically.  Stay
tuned for that.  But for now just use **runAccWith**.)

One might at this point observe that it is possible to use @CUDA.run@
directly within a **Par** computation.  This is true.  The advantage
of using **runAcc** is that it informs the **Par** scheduler of what's
going on.  The scheduler can therefore execute other work on the CPU
core that would otherwise be waiting for the GPU.

An application could achieve the same effect by creating a dedicated
thread to talk to the GPU, but that wouldn't jive well with a pure
computation (**forkIO**), and it's easier to let **meta-par** handle
it.

The second benefit of integrated scheduling is that the scheduler can
automatically divide work between the CPU and GPU.  Eventually, when
there are [full-featured, efficient CPU-backends for
Accelerate](https://github.com/HIPERFIT/accelerate-opencl), this will
happen transparently.  For now you need to use **unsafeHybrid**
described in the next section.

Finally, our [soon-forthcoming CPU/GPU/Distributed
schedulers](https://github.com/simonmar/monad-par/tree/4332a2dc6fab7ccdb702ad5b285e052f62b43c14/meta-par-dist-tcp)
can make more intelligent decisions if they know where all the calls
to GPU computations occur.


Hybrid CPU/GPU workloads.
-------------------------------------------------

The [meta-par](http://hackage.haskell.org/package/meta-par) and
[meta-par-accelerate](http://hackage.haskell.org/package/meta-par-accelerate)
packages, as currently released, include a generalized work-stealing
infrastructure.

The relevant point for our purposes here, is that the CPU and GPU can
each steal work from one another.  Work-stealing is by no means the
most sophisticated CPU/GPU partitioning on the scene.  Much literature
has been written on the subject, and it can get quite sophisticated
(for example, modeling memory transfer time).  However, as on regular
multicores, work-stealing provides an admirable combination of
simplicity and efficacy.  For example, if a given program runs much
better on the CPU or the GPU respectively, then that device will end
up doing more of the work.

In the current release, we use [unsafeHybridWith, documented
here](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#v:unsafeHybrid),
to spawn a task with two separate implementations--one CPU and one
GPU--leaving the scheduler to choose between them.  Here's a silly
example:

> hybrid :: Par (IVar Float)
> hybrid = unsafeHybridWith CUDA.run (`indexArray` Z) (return 33.0, triv)
> runHybrid = print (runPar (hybrid >>= get))

The call to **unsafeHybridWith** is passed a task that consists of a
separate CPU **(return 33.0)** and GPU (**triv**) component.


Further generalizations
-----------------------------

 * ParAccelerate

Actually, thanks to GHC's support for [Constraint Kinds](URL_TODO), it
is possible to genereralize this even further, abstracting over not
just Accelerate implementations but over different kinds of EDSLs thata

 * ParOffChip

Note that those types are general enough to encapsulate (**Arrays**
constraint, **Acc** type) and CloudHaskell-style remote calls
(**Serializable** constraint, **Closure** type).

Yet, we haven't yet seen a strong motivation for generalizing the
interface to this extent.  (And there's always the danger that the
interface becomes difficult to use due to ambiguity errors from the
type checker.)  If you have one, let us know!


Notes for Hackers
-----------------------------

If you want to work with the github repositories, you need to have GHC
7.4 and the latest cabal-install (0.14.0).  You can check everything
out here:

    git clone git://github.com/simonmar/monad-par.git --recursive

Try **make mega-install-gpu** if you already have CUDA installed on your machine.


Appendix: Documentation Links
-----------------------------


 * [accelerate-cuda](http://www.cs.indiana.edu/~rrnewton/haddock/accelerate-cuda/)


> main = do putStrLn "hi"
>           runDirect
>           runWPar1
>           runWPar2
> tmp :: Par (Scalar Float)
> tmp = runAcc triv


	> module Main where

	How to write hybrid CPU/GPU programs with Haskell
	-------------------------------------------------

	What's better than programming a GPU with a high-level,
	Haskell-embedded DSL (domain-specific-language)? Well, perhaps
	writing portable CPU/GPU programs that utilize both pieces of
	silicon--with dynamic load-balancing between them--would fit the
	bill.

	This is one of the heterogeneous programming scenarios supported by
	our new meta-par packages. A draft paper [can be found
	here](http://www.cs.indiana.edu/~rrnewton/papers/meta-par_submission.pdf),
	which explains the mechanism for building parallel schedulers out of
	"mix-in" components. In this post, however, we will skip over that
	and take a look at CPU/GPU programming specifically.

	This post assumes familiarity with the monad-par parallel
	programming library, [which is described in this
	paper](http://www.cs.indiana.edu/~rrnewton/papers/haskell2011_monad-par.pdf).

	Getting Started
	-------------------------------------------------

	First, we install the just-released [meta-par-accelerate
	package](http://hackage.haskell.org/package/meta-par-accelerate):

	cabal install meta-par-accelerate

	And then we import the following module:

	> import Control.Monad.Par.Meta.AccSMP

	This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par
	multicore CPU) scheduling: It also reexports the [ParAccelerate
	type
	class](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#t:ParAccelerate)
	which provides the ability to launch GPU computations from within a
	Par computation.

	Next, we also import Accelerate itself to so that we can express Acc
	computations that can run on the GPU:

	> import Data.Array.Accelerate
	> import Data.Array.Accelerate.CUDA as CUDA

	(By the way, this blog post is an executable literate Haskell file
	[that can be found here](GITHUB_GIST).)

	Now we are ready to create a trivial Accelerate computation:

	> triv :: Acc (Scalar Float)
	> triv = let arr = generate (constant (Z :. (10::Int))) (\ i -> 3.3 )
	> in fold (+) 0 arr

	We could run this directly using CUDA, which would print out
	Array (Z) [33.0], Accelerate's of saying 33.0
	(i.e. it's a zero-dimensional array):

	> runDirect = print (CUDA.run triv)

	If we are instead inside a Par computation, we simply use *runACC or
	runAccWith:

	> runWPar1 = print (runPar (runAcc triv))
	> runWPar2 = print (runPar (runAccWith CUDA.run triv))

	The former uses the default Accelerate implementation. The latter
	specifies which Accelerate implementation to use. After all, there
	might ultimately be several -- OpenCL, CUDA, plus varius CPU backends.

	(In the future, we plan to add the ability to change the default
	Accelerate backend either at the runPar site, or statically. Stay
	tuned for that. But for now just use runAccWith.)

	One might at this point observe that it is possible to use @CUDA.run@
	directly within a Par computation. This is true. The advantage
	of using runAcc is that it informs the Par scheduler of what's
	going on. The scheduler can therefore execute other work on the CPU
	core that would otherwise be waiting for the GPU.

	An application could achieve the same effect by creating a dedicated
	thread to talk to the GPU, but that wouldn't jive well with a pure
	computation (forkIO), and it's easier to let meta-par handle
	it.

	The second benefit of integrated scheduling is that the scheduler can
	automatically divide work between the CPU and GPU. Eventually, when
	there are [full-featured, efficient CPU-backends for
	Accelerate](https://github.com/HIPERFIT/accelerate-opencl), this will
	happen transparently. For now you need to use unsafeHybrid
	described in the next section.

	Finally, our [soon-forthcoming CPU/GPU/Distributed
	schedulers](https://github.com/simonmar/monad-par/tree/4332a2dc6fab7ccdb702ad5b285e052f62b43c14/meta-par-dist-tcp)
	can make more intelligent decisions if they know where all the calls
	to GPU computations occur.


	Hybrid CPU/GPU workloads.
	-------------------------------------------------

	The [meta-par](http://hackage.haskell.org/package/meta-par) and
	[meta-par-accelerate](http://hackage.haskell.org/package/meta-par-accelerate)
	packages, as currently released, include a generalized work-stealing
	infrastructure.

	The relevant point for our purposes here, is that the CPU and GPU can
	each steal work from one another. Work-stealing is by no means the
	most sophisticated CPU/GPU partitioning on the scene. Much literature
	has been written on the subject, and it can get quite sophisticated
	(for example, modeling memory transfer time). However, as on regular
	multicores, work-stealing provides an admirable combination of
	simplicity and efficacy. For example, if a given program runs much
	better on the CPU or the GPU respectively, then that device will end
	up doing more of the work.

	In the current release, we use [unsafeHybridWith, documented
	here](http://www.cs.indiana.edu/~rrnewton/haddock/abstract-par-accelerate/Control-Monad-Par-Accelerate.html#v:unsafeHybrid),
	to spawn a task with two separate implementations--one CPU and one
	GPU--leaving the scheduler to choose between them. Here's a silly
	example:

	> hybrid :: Par (IVar Float)
	> hybrid = unsafeHybridWith CUDA.run (`indexArray` Z) (return 33.0, triv)
	> runHybrid = print (runPar (hybrid >>= get))

	The call to unsafeHybridWith is passed a task that consists of a
	separate CPU (return 33.0) and GPU (triv) component.


	Further generalizations
	-----------------------------

	* ParAccelerate

	Actually, thanks to GHC's support for [Constraint Kinds](URL_TODO), it
	is possible to genereralize this even further, abstracting over not
	just Accelerate implementations but over different kinds of EDSLs thata

	* ParOffChip

	Note that those types are general enough to encapsulate (Arrays
	constraint, Acc type) and CloudHaskell-style remote calls
	(Serializable constraint, Closure type).

	Yet, we haven't yet seen a strong motivation for generalizing the
	interface to this extent. (And there's always the danger that the
	interface becomes difficult to use due to ambiguity errors from the
	type checker.) If you have one, let us know!


	Notes for Hackers
	-----------------------------

	If you want to work with the github repositories, you need to have GHC
	7.4 and the latest cabal-install (0.14.0). You can check everything
	out here:

	git clone git://github.com/simonmar/monad-par.git --recursive

	Try make mega-install-gpu if you already have CUDA installed on your machine.


	Appendix: Documentation Links
	-----------------------------



	* [accelerate-cuda](http://www.cs.indiana.edu/~rrnewton/haddock/accelerate-cuda/)



	> main = do putStrLn "hi"
	> runDirect
	> runWPar1
	> runWPar2
	> tmp :: Par (Scalar Float)
	> tmp = runAcc triv