Skip to content

Instantly share code, notes, and snippets.

@lehins
Last active March 29, 2017 08:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lehins/334f7ae19141cf685991dd67065de22f to your computer and use it in GitHub Desktop.
Save lehins/334f7ae19141cf685991dd67065de22f to your computer and use it in GitHub Desktop.
Explicit array fusion
#!/usr/bin/env stack
-- stack --resolver lts-8.3 exec --package vector --package criterion -- ghc -O2
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Main where
import Control.Monad.ST (ST)
import Criterion.Main
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Unboxed.Mutable as MVU
-- | Delayed Array.
data DArray ix e = DArray { dSize :: !ix
, dUnsafeIndex :: ix -> e }
-- | Manifest Array.
data MArray ix e = MArray { mSize :: !ix
, mUnsafeIndex :: ix -> e }
-- | Efficient loop with an accumulator
loop :: Int -> (Int -> Bool) -> (Int -> Int) -> a -> (Int -> a -> a) -> a
loop !init' condition increment !initAcc f = go init' initAcc where
go !step !acc =
case condition step of
False -> acc
True -> go (increment step) (f step acc)
{-# INLINE loop #-}
-- | Efficient monadic loop
loopM_ :: Monad m => Int -> (Int -> Bool) -> (Int -> Int) -> (Int -> m a) -> m ()
loopM_ !init' condition increment f = go init' where
go !step =
case condition step of
False -> return ()
True -> f step >> go (increment step)
{-# INLINE loopM_ #-}
-- | Load a delayed Array into memory. Unlike strem fusion, this can also be parallelized.
compute
:: forall e . VU.Unbox e => DArray Int e -> MArray Int e
compute (DArray k f) = MArray k $ (VU.unsafeIndex v)
where
-- Make sure it's evaluated (deepseq == seq is for Unboxed Vector)
!v = VU.create generateArray
generateArray :: ST s (MVU.MVector s e)
generateArray = do
mv <- MVU.unsafeNew k
loopM_ 0 (< k) (+ 1) $ \ !i -> MVU.unsafeWrite mv i (f i)
return mv
{-# INLINE generateArray #-}
{-# INLINE compute #-}
makeArray1D :: Int -> (Int -> e) -> DArray Int e
makeArray1D = DArray
{-# INLINE makeArray1D #-}
mapA :: (b -> e) -> DArray ix b -> DArray ix e
mapA f (DArray k g) = DArray k (f . g)
{-# INLINE mapA #-}
foldlA :: (t1 -> t -> t1) -> t1 -> DArray Int t -> t1
foldlA f !acc (DArray k g) = loop 0 (< k) (+ 1) acc $ \ !i !acc0 -> f acc0 (g i)
{-# INLINE foldlA #-}
sumA :: DArray Int Int -> Int
sumA = foldlA (+) 0
{-# INLINE sumA #-}
main :: IO ()
main = do
let !sz = 640000
let arr1D = (`makeArray1D` id)
let vec = (`VU.generate` id)
{-# INLINE vec #-} -- <-- Degraded performance if not inlined.
let range n = VU.enumFromN 0 n :: VU.Vector Int
defaultMain
[ bgroup
"Fold Only"
[ bench "1D Array" $ whnf (sumA . arr1D) sz
, bench "Vector Unboxed" $ whnf (VU.sum . vec) sz
, bench "Vector Unboxed enumFromN" $ whnf (VU.sum . range) sz
]
, bgroup
"Fold Fused"
[ bench "1D Array" $ whnf (sumA . mapA (+ 25) . arr1D) sz
, bench "Vector Unboxed" $ whnf (VU.sum . VU.map (+ 25) . vec) sz
]
, bgroup
"Compute Only"
[ bench "1D Array" $ whnf (compute . arr1D) sz
, bench "Vector Unboxed" $ whnf (`VU.generate` id) sz
, bench "Vector Unboxed (very slow - not inlined!?!?)" $ whnf vec sz
]
, bgroup
"Compute Fused"
[ bench "1D Array" $ whnf (compute . mapA (+ 50) . arr1D) sz
, bench "Vector Unboxed" $ whnf (VU.map (+ 50) . vec) sz
]
]
@lehins
Copy link
Author

lehins commented Mar 28, 2017

Folding performance is very close to each other. Both done in constant memory. Except in case of VU.enumFromN, for some reason it is 3 times slower.

$ ./array --match prefix "Fold Only"
benchmarking Fold Only/1D Array
time                 356.6 μs   (354.2 μs .. 358.7 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 360.5 μs   (359.4 μs .. 361.6 μs)
std dev              3.663 μs   (3.126 μs .. 4.826 μs)

benchmarking Fold Only/Vector Unboxed
time                 358.1 μs   (356.7 μs .. 359.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 357.0 μs   (356.0 μs .. 358.3 μs)
std dev              3.966 μs   (3.301 μs .. 5.747 μs)

benchmarking Fold Only/Vector Unboxed enumFromN
time                 1.039 ms   (1.020 ms .. 1.059 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 1.024 ms   (1.020 ms .. 1.030 ms)
std dev              17.88 μs   (12.24 μs .. 27.72 μs)
$ ./array --match prefix "Fold Fused"
benchmarking Fold Fused/1D Array
time                 397.7 μs   (396.1 μs .. 399.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 401.4 μs   (400.4 μs .. 402.5 μs)
std dev              3.780 μs   (2.859 μs .. 5.278 μs)

benchmarking Fold Fused/Vector Unboxed
time                 387.2 μs   (384.2 μs .. 391.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 386.3 μs   (384.7 μs .. 388.1 μs)
std dev              5.686 μs   (4.181 μs .. 9.062 μs)

But loading values into memory is actually 30% slower in case of VU.generate.

$ ./array --match prefix "Compute Only"
benchmarking Compute Only/1D Array
time                 377.3 μs   (375.6 μs .. 378.9 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 383.5 μs   (382.5 μs .. 384.8 μs)
std dev              3.894 μs   (3.128 μs .. 5.224 μs)

benchmarking Compute Only/Vector Unboxed
time                 566.9 μs   (565.4 μs .. 569.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 568.5 μs   (567.0 μs .. 570.8 μs)
std dev              5.999 μs   (3.744 μs .. 8.349 μs)

benchmarking Compute Only/Vector Unboxed (very slow - not inlined!?!?)
time                 3.967 ms   (3.949 ms .. 3.990 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.983 ms   (3.965 ms .. 4.046 ms)
std dev              94.27 μs   (31.36 μs .. 190.6 μs)


$ ./array --match prefix "Compute Fused"
benchmarking Compute Fused/1D Array
time                 368.8 μs   (366.5 μs .. 370.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 372.6 μs   (371.3 μs .. 373.7 μs)
std dev              4.494 μs   (3.700 μs .. 5.449 μs)

benchmarking Compute Fused/Vector Unboxed
time                 552.3 μs   (551.3 μs .. 553.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 555.2 μs   (553.6 μs .. 557.1 μs)
std dev              5.738 μs   (4.804 μs .. 6.763 μs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment