Create a gist now

Instantly share code, notes, and snippets.

meta-par-accelerate blog post
> module Main where
How to write hybrid CPU/GPU programs with Haskell
What's better than programming a GPU with a high-level,
Haskell-embedded DSL (domain-specific-language)? Well, perhaps
writing portable CPU/GPU programs that utilize both pieces of
silicon--with dynamic load-balancing between them--would fit the
This is one of the heterogeneous programming scenarios supported by
our new **meta-par** packages. A draft paper [can be found
which explains the mechanism for building parallel schedulers out of
"mix-in" components. In this post, however, we will skip over that
and take a look at CPU/GPU programming specifically.
This post assumes familiarity with the **monad-par** parallel
programming library, [which is described in this
Getting Started
First, we install the just-released [**meta-par-accelerate**
cabal install meta-par-accelerate
And then we import the following module:
> import Control.Monad.Par.Meta.AccSMP
This provides a scheduler for (Accelerate GPU-EDSL) + (monad-par
multicore CPU) scheduling: It also reexports the [**ParAccelerate**
which provides the ability to launch GPU computations from within a
**Par** computation.
Next, we also import Accelerate itself to so that we can express **Acc**
computations that can run on the GPU:
> import Data.Array.Accelerate
> import Data.Array.Accelerate.CUDA as CUDA
(By the way, this blog post is an executable literate Haskell file
[that can be found here](GITHUB_GIST).)
Now we are ready to create a trivial Accelerate computation:
> triv :: Acc (Scalar Float)
> triv = let arr = generate (constant (Z :. (10::Int))) (\ i -> 3.3 )
> in fold (+) 0 arr
We could run this directly using CUDA, which would print out
**Array (Z) [33.0]**, Accelerate's of saying **33.0**
(i.e. it's a zero-dimensional array):
> runDirect = print ( triv)
If we are instead inside a Par computation, we simply use **runACC* or
> runWPar1 = print (runPar (runAcc triv))
> runWPar2 = print (runPar (runAccWith triv))
The former uses the default Accelerate implementation. The latter
specifies which Accelerate implementation to use. After all, there
might ultimately be several -- OpenCL, CUDA, plus varius CPU backends.
(In the future, we plan to add the ability to change the default
Accelerate backend either at the **runPar** site, or statically. Stay
tuned for that. But for now just use **runAccWith**.)
One might at this point observe that it is possible to use
directly within a **Par** computation. This is true. The advantage
of using **runAcc** is that it informs the **Par** scheduler of what's
going on. The scheduler can therefore execute other work on the CPU
core that would otherwise be waiting for the GPU.
An application could achieve the same effect by creating a dedicated
thread to talk to the GPU, but that wouldn't jive well with a pure
computation (**forkIO**), and it's easier to let **meta-par** handle
The second benefit of integrated scheduling is that the scheduler can
automatically divide work between the CPU and GPU. Eventually, when
there are [full-featured, efficient CPU-backends for
Accelerate](, this will
happen transparently. For now you need to use **unsafeHybrid**
described in the next section.
Finally, our [soon-forthcoming CPU/GPU/Distributed
can make more intelligent decisions if they know where all the calls
to GPU computations occur.
Hybrid CPU/GPU workloads.
The [meta-par]( and
packages, as currently released, include a generalized work-stealing
The relevant point for our purposes here, is that the CPU and GPU can
each steal work from one another. Work-stealing is by no means the
most sophisticated CPU/GPU partitioning on the scene. Much literature
has been written on the subject, and it can get quite sophisticated
(for example, modeling memory transfer time). However, as on regular
multicores, work-stealing provides an admirable combination of
simplicity and efficacy. For example, if a given program runs much
better on the CPU or the GPU respectively, then that device will end
up doing more of the work.
In the current release, we use [unsafeHybridWith, documented
to spawn a task with two separate implementations--one CPU and one
GPU--leaving the scheduler to choose between them. Here's a silly
> hybrid :: Par (IVar Float)
> hybrid = unsafeHybridWith (`indexArray` Z) (return 33.0, triv)
> runHybrid = print (runPar (hybrid >>= get))
The call to **unsafeHybridWith** is passed a task that consists of a
separate CPU **(return 33.0)** and GPU (**triv**) component.
Further generalizations
* ParAccelerate
Actually, thanks to GHC's support for [Constraint Kinds](URL_TODO), it
is possible to genereralize this even further, abstracting over not
just Accelerate implementations but over different kinds of EDSLs thata
* ParOffChip
Note that those types are general enough to encapsulate (**Arrays**
constraint, **Acc** type) and CloudHaskell-style remote calls
(**Serializable** constraint, **Closure** type).
Yet, we haven't yet seen a strong motivation for generalizing the
interface to this extent. (And there's always the danger that the
interface becomes difficult to use due to ambiguity errors from the
type checker.) If you have one, let us know!
Notes for Hackers
If you want to work with the github repositories, you need to have GHC
7.4 and the latest cabal-install (0.14.0). You can check everything
out here:
git clone git:// --recursive
Try **make mega-install-gpu** if you already have CUDA installed on your machine.
Appendix: Documentation Links
* [accelerate-cuda](
> main = do putStrLn "hi"
> runDirect
> runWPar1
> runWPar2
> tmp :: Par (Scalar Float)
> tmp = runAcc triv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment