athas/presentation.md

## presentation.md

      
    Raw
  

              presentation.md
            
          
  title
  author
  date
  patat
  
  
  Futhark Presentation for the /r/ProgrammingLanguages meetup
  Troels Henriksen
  November 22nd, 2020
  
  
  theme
  
  
  codeBlock
  header
  emph
  
  
  bold
  
  
  bold
  
  
 _____      _   _                _
|  ___|   _| |_| |__   __ _ _ __| | __
| |_ | | | | __| '_ \ / _` | '__| |/ /
|  _|| |_| | |_| | | | (_| | |  |   <
|_|   \__,_|\__|_| |_|\__,_|_|  |_|\_\

       22nd of November, 2020


Troels Henriksen (and many others)


DIKU - University of Copenhagen


futhark-lang.org


The Need for Speed

We need performance more than ever!


But modern parallel machines (e.g. GPUs) are difficult to program.

We need to program in parallel!


But human brains are really poor at reasoning about massive
concurrency.

Could we do nothing?

Could a sufficiently smart compiler automagically make our old code
parallel?  Probably not.
Is this parallel?

for (int i = 0; i < n; i++) {
  ys[i] = f(xs[i]);
}

. . .
Yes, but requires the compiler to perform index analysis to see it.
What about this one?

for (int i = 0; i < n; i++) {
  ys[i+1] = f(ys[i], xs[i]);
}

. . .
Yes, but I would be surprised if any parallelising compiler could
detect it.
Write what you mean!

We express algorithms that have plenty of parallelism with a
sequential vocabulary.
. . .
What if we improved our vocabulary?

for (int i = 0; i < n; i++) {
  ys[i] = f(xs[i]);            ⇨   let ys = map f xs
}

for (int i = 0; i < n; i++) {
  ys[i+1] = f(ys[i], xs[i]);   ⇨   let ys = scan f xs
}

The existing functional programming vocabulary is almost what we
need.
There is hope!


Functional programming provides a programming model that is in
principle parallel.


The trick is merely how to generate code that is also fast in
practice.


Win some, lose some


We need a restricted functional language that omits complex
features that are hard to implement efficiently...


...but is still flexible enough to write the programs we care
about.


Futhark!

A Language


Least common denominator purely functional hardware-agnostic
language with parallel constructs.


Practical performance is the main goal, which carries many
restrictions.


Syntax is a mix of Haskell and SML to ensure equal dislike by
everybody.


. . .
A Compiler


Fusion, moderate flattening, and many other optimisations.


Focuses on efficient common case for now.


Generates GPU-optimised code via OpenCL or CUDA.


... or multicore CPU code with pthreads.


Examples

Example: Dot Product

. . .
let dotprod [n] (xs: [n]i32) (ys: [n]i32): i32 =
  reduce (+) 0 (map2 (*) xs ys)

Example: Matrix Multiplication

. . .
let dotprod [n] (xs: [n]i32) (ys: [n]i32): i32 =
  reduce (+) 0 (map2 (*) xs ys)

let matmul [n][m][p] (a: [n][m]i32) (b: [m][p]i32): [n][p]i32 =
  map (\a_row ->
         map (\a_col ->
                dotprod a_row a_col)
             (transpose b))
      a

Goals

Futhark's goal in one sentence

Be faster than everything that is more flexible, and more flexible
than everything that is faster.
. . .
Not a goal

Taking over the world!
. . .
Futhark is not all-or-nothing


We don't want to replace all languages.  (Nor could we.)


Futhark is for small performance-sensitive computational
kernels.  The rest of the application can remain unchanged.


Performance is a feature that makes all other features harder to implement

Higher-order functions are problematic


Normally implemented via function pointers.


GPUs do not (efficiently) support function pointers.


Indirect calls are slow even on CPUs.


. . .
Fortunately, the 70s were full of people who did not like function
pointers either.
Defunctionalisation by John Reynolds (1972)

Basic idea:


Replace each lambda by a tagged data value that captures the free
variables:
  \x -> x + y
    ⇨ Lam0 y

  \x -> z * x
    ⇨ Lam1 z


. . .


Replace function calls by case-switching over these funcstions
  f a
  ⇨ case f of Lam0 y -> a + y
              Lam1 z -> z + a
              ...


. . .
Unfortunately, branches are also problematic!
Ensuring branch-free defunctionalisation

Type restrictions to ensure we always know which function is being called.
. . .
let f = if b1 then \x -> foo
        else if b2 then \x -> bar
        else ... \x -> baz
in... f y


Which function f is applied?


So we ban conditionals from returning functions.


Arrays may also not contain functions

let fs = [\y -> y+a, \z -> z*b, ...]
in... fs[i] 5


Which function fs[i] is applied?

Example of defunctionalisation

Original program:
let a = 1
let b = 2
let f = \x -> x+a
in f b

. . .
Defunctionalised:
let a = 1
let b = 2
let f = {a=a} -- record that captures free variables
in f' f b

with lifted function
let f' env x =
  let a = env.a
  in x+a

The point


Restricting the language enables better code generation.


Crucial: the restrictions are easy to understand, checked by
the type checker, and usually not a hindrance in practice


Let's talk about value representation

. . .

Futhark unboxes everything
Everything is call-by-value (except arrays)

A triple (a,b,c) is treated as three distinct values, kept in
registers.
Arrays of tuples


Consider arrays of type [](i32, i8).
Since an i32 is four bytes and a i8 is one byte, how should
Futhark store this in memory?

. . .
0          4    5          9    10...
┌──────────┬────┬──────────┬────┬─
│    i32   │ i8 │    i32   │ i8 │...
└──────────┴────┴──────────┴────┴─

. . .
Problem: Unaligned access.
. . .
0          4          8          12         16
┌──────────┬──────────┬──────────┬──────────┬─
│    i32   │    i8    │    i32   │    i8    │...
└──────────┴──────────┴──────────┴──────────┴─

Problem: Waste of memory.
Tuples of arrays

The Futhark compiler represents an array of type [n](t1, t2, t3...)
as multiple arrays of types [n]t1, [n]t2, [n]t3...
0          4          8           12          16
┌──────────┬───────────┬──────────┬───────────┬─
│    i32   │    i32    │    i32   │    i32    │...
└──────────┴───────────┴──────────┴───────────┴─

0    1    2    3    4    5    6    7    8    9
┌─────────┬────┬────┬────┬────┬────┬────┬────┬─
│ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │ i8 │...
└────┴────┴────┴────┴────┴────┴────┴────┴────┴─


Common (and crucial) transformation.
Called "struct of arrays" in legacy languages.
Automatically done by the Futhark compiler.
Only affects internal language.

Conclusions

Conclusions


Writing a small and focused language is fun.


Designing a language for performance requires you to think hard
about what the language permits.


Lots more on our dev blog, particularly:


https://futhark-lang.org/blog/2017-12-27-reflections-on-a-phd-accidentally-spent-on-language-design.html


https://futhark-lang.org/blog/2018-06-18-designing-a-programming-language-for-the-desert.html


https://futhark-lang.org/blog/2016-09-03-language-design.html